If you are reading or manipulating XML/HTML, you better use parsers specific, rather than regex (because regex is not the best tool for these cases). In Java, a good alternative is to use the jsoup library.
Moreover, I am basing myself on the comments, in particular in this, in which you say that "I have locations in the same file that contains the word home that cannot be changed". So I’m assuming the substitution should be made only when the word "House" is inside a CDATA
, and this in turn is inside the tag text
. In any other case no replacement shall be made.
So with jsoup it would look like this:
import org.jsoup.Jsoup;
import org.jsoup.nodes.CDataNode;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.parser.Parser;
String texto = "<text><![CDATA[Casa]]></text><text><![CDATA[Qualquer texto que tenha Casa no meio]]></text>"
+ "<text>Texto com Casa mas não dentro de CDATA</text>"
+ "<text><![CDATA[Casamento]]></text>";
Document doc = Jsoup.parse(texto, "", Parser.xmlParser());
for (Element e : doc.select("text")) { // pegar todas as tags "text"
for (Node node : e.childNodes()) { // verificar se tem CDATA
if (node instanceof CDataNode) {
// trocar por outro CDATA, contendo "Edifício" no lugar de "Casa"
CDataNode cdata = (CDataNode) node;
String novoTexto = cdata.getWholeText().replaceAll("\\bCasa\\b", "Edifício");
cdata.replaceWith(new CDataNode(novoTexto));
}
}
}
System.out.println(doc);
As I only check the tags text
who possess CDATA
, and only change the text in these cases, the result is:
<text><![CDATA[Edifício]]>
</text>
<text><![CDATA[Qualquer texto que tenha Edifício no meio]]>
</text>
<text>
Texto com Casa mas não dentro de CDATA
</text>
<text><![CDATA[Casamento]]>
</text>
The third case was not replaced because, despite having the word "House", it is not inside a CDATA
(which is the criteria I used, but anyway, if the criteria is different, it’s not hard to use jsoup’s features to check the conditions you need - something much harder to check with regex, depending on what you need).
Note that the fourth case (a CDATA
which contains the word "Marriage") is not replaced. This is a corner case that the another answer missed (in this case, the code you have there would exchange the word "Marriage" for "Building" - see).
In my code above this does not happen because I use the marker \b
(known as "word Boundary", something like "boundary between words" - see more about it here). Basically, the \b
marks a position of the string that has an alphanumeric character before and a non-alphanumeric character after (or vice versa), thus ensuring that I will only replace "Home" when it is a complete word, and not part of a word.
Another detail of another answer is that it checks whether there is a CDATA
, but then does the replace
throughout the string (including the other snippets that you implied should not be replaced - see). So if you have a CDATA
with "Home", but also have "Home" in another part of the string that should not be replaced, both will be.
Anyway, using a specific library avoids the use of regex, which is not the most suitable tool for this case. Regex handles text without taking into account its format or semantics, and XML/HTML has too many variations (more than one regular expression is able to detect). Any minimal variation in XML may require a - not always trivial - change in regex (as we can see here - the example is in HTML, but the same considerations apply to XML).
A typical case is whether the tag is within a comment (i.e., between <!--
and -->
). jsoup can detect this situation and ignores the comments, that is, it does not replace the commented section. The solution of the other answer cannot detect this case (because the regex only looks at the CDATA
, without taking into account the context in which she is) and ends up replacing everything. Of course, as a comment, it may not make a difference, but this is only one of many possible cases in which the parser is much better than regex to detect certain situations and not give false positives.
If you just want to replace "House" with "Building", a simple
.replace("casa", "edificio")
doesn’t solve?– Gustavo
@Gustavo wanted it to be this simple, I have places in the same file that contain the word home that can’t be changed.
– Adriano Gomes
I recommend you take a look at the methods
string.matches(Pattern)
andstring.regionMatches(int, String, int, int)
.– NinjaTroll