Regex in word XML

Asked

Viewed 73 times

2

I have an xml that came from a docx in this format:

<w:p w:rsidR="00AE2D8E" w:rsidRPr="00AE2D8E" w:rsidRDefault="00AE2D8E">
    <w:pPr>
        <w:rPr>
            <w:lang w:val="en-US"/>
        </w:rPr>
    </w:pPr>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t xml:space="preserve">Lorem ipsum dolor sit </w:t>
    </w:r>
    <w:proofErr w:type="spellStart"/>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:b/>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t>amet</w:t>
    </w:r>
    <w:proofErr w:type="spellEnd"/>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:b/>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t xml:space="preserve"> </w:t>
    </w:r>
    <w:proofErr w:type="spellStart"/>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:b/>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t>consecteur</w:t>
    </w:r>
    <w:proofErr w:type="spellEnd"/>
    <w:r w:rsidRPr="00AE2D8E">
        <w:rPr>
            <w:b/>
            <w:lang w:val="en-US"/>
        </w:rPr>
        <w:t>.</w:t>
    </w:r>
    <w:bookmarkStart w:id="0" w:name="_GoBack"/>
    <w:bookmarkEnd w:id="0"/>
</w:p>

What is written in docx is "Lorem ipsum dolor sit Amet consecteur." but it ends up breaking because of the differences in font, Bold, etc.

The problem is that I need to replace the text "Lorem ipsum dolor sit Amet consecteur." with any other text.

Does anyone know how to do this by regex? Is it possible? If not, what other viable option?

EDIT: So my goal is to replace the text "Lorem ipsum dolor sit Amet consecteur." with another text. The problem is that in the middle of it, on account of docx xml, are created orientation tags of text formatting (, ). The regex I have here is:

\bLorem ipsum dolor sit amet consecteur.\b

This regex ends up not finding the phrase on account of the codes in the middle, the ideal is that it replaced ignoring the codes in the middle.

  • I don’t understand your doubt, if you edit and explain better I try to help you!

  • Okay, I edited trying to explain better.

  • I noticed that the text is fragmented into some tags. I could delete all these tags where there is text and leave only one?

  • Here’s what it would look like: https://jsfiddle.net/n7oLhfk3/

  • I don’t think it’s going to happen, imagine it in a large, multi-part document, I need to replace it in several parts. Explaining in general, I need to translate parts of a docx.

  • You want to delete the tags and just take the text and replace?

Show 1 more comment

1 answer

0


The best way to capture a text in the case of your XML is by using the open and close tags as the capture delimiter, ie, capture anything that is outside the tags, starting to capture any character from the closing of the tag > and delimiting the catch until another tag is opened <.

The next regex does just that:

>([A-zÀ-ÿ.,:?! ]{1,})<|>([ A-zÀ-ÿ.,:?!]{1,})\n

You can see how this regex works here.

Explanation of the regex:

  • >([A-zÀ-ÿ.,:?! ]{1,})< - states that regex will start capturing from the character <, from there we have a capture group ([A-zÀ-ÿ]{1,}), it will capture 1 or more letters, numbers, spaces or punctuation, provided that these characters then have the opening of another tag or is up <
  • | - is an OR operator, it indicates that this regex can accept the previous pattern or the default after the delimiter
  • >([ A-zÀ-ÿ.,:?!]{1,})\n - does the same thing as group 1, but its delimiter is line breaking, for cases where the text is the last thing from the line to the opening of the tag in the next line.
  • Still waiting for the comment of who denied justifying the act...

  • Man, I think I can make it work with this.

  • Can you tell me why you didn’t take the "Lorem ipsum dolor sit"

  • yes, because I forgot to put in the capture possibilities the space, I will update the response and the test link of regex

  • @Edited Julian

  • @Juliano let me know if it worked or if you need more help ;)

  • Sorry... I just had time now, the solution worked, thank you.

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.