Regex capturing it all

Asked

Viewed 133 times

2

I’m in trouble with a Regex, she’s not just taking it 1 de 8 as I wish, she’s getting way beyond this, see : https://www.regex101.com/r/eX6bC9/1

That’s the string I’m trying to match:

<span class='pages'>1 de 8</span><span class='current'>1</span><a class="page larger" href="http://megafilmeshd.net/category/lancamentos/page/2/">2</a><a class="page larger" href="http://megafilmeshd.net/category/lancamentos/page/3/">3</a><span class='extend'>...</span>

And the regex:

<span class='pages'>(.*)<\/span>

3 answers

4

To capture only

<span class='pages'>1 de 8</span>

Add a query in regex, it means that the content group within the parentheses will match only once, it 'overrides' or combine everything as much as possible (.*)

<span class='pages'>(.*?)<\/span>

2

Your problem is that the quantifier * is greedy, Which means he’ll marry as much of the down payment as possible before he quits. If you want it to marry as little as possible, you can use its lazy variant, the *?:

<span class='pages'>(.*?)<\/span>

That said, think twice before using regular expressions to interpret HTML. In some cases much limited may even serve, but in general it is better to use a parser complete for that language.

1

Complementing the other answers, there is a corner case in their proposed solution. If HTML has a span commented within the span you want to take:

<span class='pages'>1 de <!-- <span> comentado </span> --> 8</span><span class='current'>1</span>

A regex <span class='pages'>(.*?)</span> will take the stretch:

<span class='pages'>1 de <!-- <span> comentado </span>

Leaving out the closure of the comment (-->), the 8 and the closing of the tag (</span>) - see here.

This happens because the quantifier Lazy ("lazy") *? takes as few characters as possible.

That is, the regex first takes <span class='pages'>, and then picks up as few characters as possible that have a </span> then. So the regex only goes to the </span> that is within the comments.


It’s cases like this that say regex is not the best tool to make Parsing html. For the above case, for example, any parser HTML (which the vast majority of programming languages have) can easily solve, as they can already parse the structure of HTML and comments can be easily ignored. Already with regex, you would have to include in the expression a snippet to detect comments (that it’s not that simple), more or less like this:

<span class='pages'>((?:[^<>]*|(?=<!--).*?-->)*)</span>

See here this regex working.

Now, instead of just .*?, there are two alternatives (separated by |, which means or):

  1. [^<>]: is a character class denied. In this case, it is "any character that nay be it < nor >". This ensures that we will not exceed the closing of the tag, preventing the regex from "invading" others span's.
  2. (?=<!--).*?-->:
    • first we have a Lookahead (the stretch (?=<!--)), that checks if something exists ahead. In case, it checks if we have a comment opening (<!--)
    • then we have .*? (zero or more characters) with quantifier Lazy so that it doesn’t invade other comments (the same way the original solution uses .*? to prevent regex from invading other tags span)
    • followed by closing comment -->

Every alternation has the quantifier * (zero or more occurrences), so it’s basically ((?:opção 1|opção 2)*). That is, within the tag span i can have several characters that are not < nor >, or several comments. I use (?: so that the internal parentheses do not form a capture group (otherwise the content of it would be captured in a separate group).


But that still doesn’t solve every case. If the span has another tag inside it:

<span class='pages'>1 de <!-- <span> comentado </span> --> <a href="www.google.com">link</a> 8</span><span class='current'>1</span>

The previous regex does not solve, since I used [^<>] so that it does not invade other tags, and so the regex fails because inside the span has the tag <a>. Then I’d have to switch to something like:

<span class='pages'>((?:[^<>]*|(?=<!--).*?-->|<((?!span\b)[^> ]+)[^>]*>.*?</\2>)*)</span>

See here this regex working.

I added one more option in the toggle: <((?!span\b)[^> ]+)[^>]*>.*?</\2>. Basically, she’ll take any tag other than span, but in a very "generic" way, since it does not validate much. For the name of the tag, for example, I used [^> ] (anything that is not > or space), which means that if you have tags like <123.-*()x>, the regex will accept (see). Of course you can improve it so that it only accepts valid tag names, but look again at regex: it’s complicated enough, and yet it doesn’t cover all possible cases of valid HTML’s. Is it worth going on?

Of course, for a few simpler cases, small snippets of HTML, and controlled entries, in which you "know" that you will not have cases like the ones mentioned above, a simpler regex - as the other answers suggested - can solve. But make the HTML a little more complicated and it’s already starting to be unfeasible. For these cases, a parser specific is the most appropriate.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.