Regex to search for word inside tag with CDATA

Question

Regex to search for word inside tag with CDATA

Asked 4 years, 11 months ago

Viewed 219 times

5

I have a file that contains the following string possibilities:

1st Case: <text><![CDATA[Casa]]></text>

2nd Case: <text><![CDATA[Qualquer texto que tenha Casa no meio]]></text>

I’m trying to put together a regular expression to make a replace of the word Casa for Edifício, but I’m having trouble putting together such an expression, I tried it as follows:

String text = "<text><![CDATA[Casa]]></text>";
String regex = "(\\<text><![CDATA[\\(\\w+)(/Casa)(\\w+)(\\]]></text>))";

Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);

But this returns me error in the compilation of Pattern, it will be possible to make a regex that returns me these two cases?

If you just want to replace "House" with "Building", a simple .replace("casa", "edificio") doesn’t solve?

– Gustavo

2019/12/03 at 13:21
@Gustavo wanted it to be this simple, I have places in the same file that contain the word home that can’t be changed.

– Adriano Gomes

2019/12/03 at 13:43
I recommend you take a look at the methods string.matches(Pattern) and string.regionMatches(int, String, int, int).

– NinjaTroll

2019/12/04 at 12:34

2 answers

2

I suggest the following expression Refex : (?<=CDATA\[)(.*?)(?=\])

Example of implementation

    String casoUm = "<text><![CDATA[Casa]]></text>";
    String casoDois = "<text><![CDATA[Qualquer texto que tenha Casa no meio]]></text>";
    String regex = "(?<=CDATA\\[)(.*?)(?=\\])";

    Matcher matcher = Pattern.compile(regex).matcher(casoUm);
    if (matcher.find()) {
        System.out.println(casoUm.replaceAll("Casa", "Edifício"));
    }

    matcher = Pattern.compile(regex).matcher(casoUm);
    if (matcher.find()) {
        System.out.println(casoDois.replaceAll("Casa", "Edifício"));
    }

Example: Regex Example

1

Francis, I have a question, can you tell me if it is possible to consider the <text> in the Pattern?

– Adriano Gomes

2019/12/03 at 21:25
Yeah, maybe something like that: (?<=<text><! [CDATA[)(.?)(?=]]></text>)* String regex = "(?<=<text><!\\[CDATA\\[)(.*?)(?=\\]\\]><\\/text>)";

– Francisco Martins

2019/12/04 at 09:15

Browser other questions tagged java regex xml

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-12-10T02:37:38+00:00

If you are reading or manipulating XML/HTML, you better use parsers specific, rather than regex (because regex is not the best tool for these cases). In Java, a good alternative is to use the jsoup library.

Moreover, I am basing myself on the comments, in particular in this, in which you say that "I have locations in the same file that contains the word home that cannot be changed". So I’m assuming the substitution should be made only when the word "House" is inside a CDATA, and this in turn is inside the tag text. In any other case no replacement shall be made.

So with jsoup it would look like this:

import org.jsoup.Jsoup;
import org.jsoup.nodes.CDataNode;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.parser.Parser;


String texto = "<text><![CDATA[Casa]]></text><text><![CDATA[Qualquer texto que tenha Casa no meio]]></text>"
        + "<text>Texto com Casa mas não dentro de CDATA</text>"
        + "<text><![CDATA[Casamento]]></text>";
Document doc = Jsoup.parse(texto, "", Parser.xmlParser());

for (Element e : doc.select("text")) { // pegar todas as tags "text"
    for (Node node : e.childNodes()) { // verificar se tem CDATA
        if (node instanceof CDataNode) {
            // trocar por outro CDATA, contendo "Edifício" no lugar de "Casa"
            CDataNode cdata = (CDataNode) node;
            String novoTexto = cdata.getWholeText().replaceAll("\\bCasa\\b", "Edifício");
            cdata.replaceWith(new CDataNode(novoTexto));
        }
    }
}
System.out.println(doc);

As I only check the tags text who possess CDATA, and only change the text in these cases, the result is:

<text><![CDATA[Edifício]]>
</text>
<text><![CDATA[Qualquer texto que tenha Edifício no meio]]>
</text>
<text>
 Texto com Casa mas não dentro de CDATA
</text>
<text><![CDATA[Casamento]]>
</text>

The third case was not replaced because, despite having the word "House", it is not inside a CDATA (which is the criteria I used, but anyway, if the criteria is different, it’s not hard to use jsoup’s features to check the conditions you need - something much harder to check with regex, depending on what you need).

Note that the fourth case (a CDATA which contains the word "Marriage") is not replaced. This is a corner case that the another answer missed (in this case, the code you have there would exchange the word "Marriage" for "Building" - see).

In my code above this does not happen because I use the marker \b (known as "word Boundary", something like "boundary between words" - see more about it here). Basically, the \b marks a position of the string that has an alphanumeric character before and a non-alphanumeric character after (or vice versa), thus ensuring that I will only replace "Home" when it is a complete word, and not part of a word.

Another detail of another answer is that it checks whether there is a CDATA, but then does the replace throughout the string (including the other snippets that you implied should not be replaced - see). So if you have a CDATA with "Home", but also have "Home" in another part of the string that should not be replaced, both will be.

Anyway, using a specific library avoids the use of regex, which is not the most suitable tool for this case. Regex handles text without taking into account its format or semantics, and XML/HTML has too many variations (more than one regular expression is able to detect). Any minimal variation in XML may require a - not always trivial - change in regex (as we can see here - the example is in HTML, but the same considerations apply to XML).

A typical case is whether the tag is within a comment (i.e., between ). jsoup can detect this situation and ignores the comments, that is, it does not replace the commented section. The solution of the other answer cannot detect this case (because the regex only looks at the CDATA, without taking into account the context in which she is) and ends up replacing everything. Of course, as a comment, it may not make a difference, but this is only one of many possible cases in which the parser is much better than regex to detect certain situations and not give false positives.