Remove comments from HTML

Question

Remove comments from HTML

Asked 7 years, 1 month ago

Viewed 241 times

1

I have a TXT taken from an HTML file.

It is full of comments that I need to remove and so I thought to use the method replaceAll class String, doing the following:

public static void main(String[] args) throws IOException {
   FileReader ler = new FileReader("/home/adriano/Desktop/html.txt");
   BufferedReader reader = new BufferedReader(ler);  
   String arquivo = "";
   while( reader.readLine() != null ){
     arquivo += reader.readLine() + "\n";
   }
   System.out.println(arquivo.replaceAll("s/<!--(.|\\s)*?-->//g", ""));
}

Turns out you didn’t remove the comments, anyone can tell me if that regex is correct?

I have comments like:

<!-- Assinatura principal -->
<!-- [if gte mso 9]><xml> <o:OfficeDocumentSettings> <o:AllowPNG /> </o:OfficeDocumentSettings> </xml><![endif]-->

Test like this: System.out.println(arquivo.replaceAll("", ""));

– Sorack

2019/02/05 at 13:28
@Sorack using this regex released the Exception: Exception in thread "main" java.lang.StackOverflowError

– Adriano Gomes

2019/02/05 at 13:32
On the line of replaceAll?

– Sorack

2019/02/05 at 13:34
The s/ and //g are not part of regex itself. This syntax is used in other languages (and in some commands, such as sed), but in Java you only pass the expression: replaceAll("", ""). Although, since the input is an HTML, it might be better to use specific libs, such as jsoup, for example. Regex may even work for some cases, but there are special cases that will not be covered, and that a lib specialized in HTML will handle more easily. Finally, required link: https://stackoverflow.com/a/1732454 :-)

– hkotsubo

2019/02/05 at 13:44

1 answer

Browser other questions tagged java regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-03-21T18:28:59+00:00

The s/ and //g are not part of regex itself. This syntax is used in other languages (and in some commands, such as sed), but in Java you only need to pass the regular expression as parameter:

replaceAll("<!--(.|\\s)*?-->", "")

Anyway, this expression is not very good in terms of efficiency, and can even give StackOverflowError if used in very large strings (as seems to be your case).

Basically, the alternation (indicated by |) causes the two alternatives to be checked if necessary. Most often it will fall in the first case (the dot, meaning "any character", but which by default actually means "any character except line breaks"). Every time regex finds a line break, the toggle will test the first option and then the second. Also, the Lazy operator (*?), although convenient for your case, also has its price. For small strings, these details don’t matter, but for large strings it starts to make a difference (and I’m assuming your file is big enough to make a difference, since you mentioned that it did StackOverflowError).

You can remove the toggle by making the dot also match line breaks, using the option DOT_ALL. In the case of the method replaceAll, cannot pass this option as parameter, but you can enable it by placing (?s) at the beginning of the regex:

arquivo.replaceAll("(?s)<!--.*?-->", "");

This should already improve a little the performance of regex. I made a comparison in regex101.com, and see that the first version needs more steps to check the string, if compared to the second version.

Of course that these numbers will vary depending on the strings, since each language has a engine with proper implementation details, some optimize some cases, etc. But overall, removing the toggle already greatly increases performance. Since I don’t have your full file and I couldn’t simulate the StackOverflowError, I’m basing myself on regex101.com tests (but I suggest testing the actual files to be sure).

The other advantage is for cases of having poorly formed comments (missing the closing tag, for example). Note that the first version needs more than 2700 steps to realize that the comment has no closure, while the second version It needs only 140. Even if your file doesn’t have that, the overall performance of the second version, compared to the first, already justifies the change.

Alternative

You can still optimize a little more using the regex below:

arquivo.replaceAll("<!--(?>[^-<>]*)(?>(?!-->)[-<>][^-<>]*)*-->", "");

She uses a technique known as Unrolling the Loop, and is described in more detail in this book. But basically it consists of identifying 3 basic elements of the stretch you want to capture:

the delimiters: in our case, they are , which appear at the beginning and end of the regex
the "normal": that is, the most frequent that appears between the delimiters. In this case, I used [^-<>]* (zero or more characters other than the hyphen, < or >)
the "special": the character that is not normal (more frequent in a comment) and/or that may mean that we found the final delimiter (in this case, I used [-<>] - a hyphen, or < or >)

The general format of regex is delim normal* (especial normal*)* delim. Are also used atomic groups (indicated by (?>)), that make the engine not do backtracking (It happens when she doesn’t find one match, but goes back a few steps to try other combinations of the string, causing it to take longer). But since "normal" and "special" are mutually exclusive, backtracking will be done for nothing, so the atomic group skips these unnecessary steps.

I also use a Lookahead negative (the stretch (?!-->)). Basically, this excerpt checks if something does not exist ahead (in case, I check if there is no closure of the comment -->). If not, the regex proceeds to the [-<>] (hyphenate or < or >), followed by zero or more characters that are not hyphens (and this whole stretch can be repeated several times, because there is another * out of parentheses). This ensures that hyphens can exist and even < or > within the comment (ex:  is a single comment).

See here that performance improves somewhat compared to the second version. Again, for small strings, the difference will be irrelevant, but for large strings - as it seems to be your case - it can make a big difference.

Use an HTML parser

But maybe regex is not the best solution for your case. Why not use an HTML parser?

I made the example below with the jsoup version 1.8.3:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;


public void removerComentarios(Node node) {
    for (int i = 0; i < node.childNodeSize();) {
        Node child = node.childNode(i);
        if (child.nodeName().equals("#comment")) {
            child.remove();
        } else {
            removerComentarios(child);
            i++;
        }
    }
}


Document doc = Jsoup.parse(arquivo);
removerComentarios(doc);
System.out.println(doc.html());

Of course there is still the problem of the file being very large and being able to burst the memory, but it still seems to me to be a simpler solution than a regex. If jsoup doesn’t suit you, you can choose another from this list.