What is a gluttonous regular expression?

Asked

Viewed 1,499 times

24

  • What is a Expressão Regular Gulosa?

  • What sets her apart from Expressão Regular Não-Gulosa?

  • Expressões regulares Gulosas consume more resources than Não-Gulosas?

  • 1

    Now as to the regular expressions, since you are studying on, recommend reading of this article

  • 1

    Valeu @Guilhermelautert

  • 3

    "Regular Gluttonous Expression" is the way fat people regularly express themselves about their own gluttony.

5 answers

20


Quantifiers

Specify how many times the previous instance should be captured

Quantifier Greedy

In general when talking about greedy quantifier refers to the *, because it represents 0 or infinite times, that is to say he will try to capture as much as possible, but in case there is nothing to capture this ok also.

However a greedy quantifier can also be {0,} which has the same effect.

A few more :

+     // captura o máximo possível, mas deve ocorrer ao menos uma vez
{1,}  // tem o mesmo efeito

Thoughts

1. {1,5}  // é um quantificador que vai de 1 a 5
2. {1,80} // é um quantificador que vai de 1 a 80
3. {60,80} // é um quantificador que deve ter no mínimo 60 e vai ate 80

The 2. would be more greedy than the 1. because he marries up to 80 while the 1. only with 5, already the 3. he says he must have at least 60 can go to 80, which would be even more greedy because it specifies a minimum.

Analogy

Think of a person eating:

  • to 1. says he eats 1 to 5 kg.
  • to 2. says he eats 6 to 80 pounds.
  • to 3. says he wants to eat 60 to 80 kg.

Non-greedy quantifiers

To make the opposite effect on the greedy quantifier is added the ?, so instead of capturing the maximum he tries to capture the minimum.

In general, reference is made to the *?, that house with the maximum, but tries to capture the minimum, in this case the minimum is 0.

A common mistake

`teste .*?` // não faz sentido ter o `.*?`, pois ele não vai capturar nada.

Minimum example

`teste`.macth(/.*?t/) // enquanto o `.*` iria capturar `test`, o mínimo captura apenas `t`

Performance

The greedy quantifier is faster, because he eats until he gives no more, while the non-greedy one is slower because he always checks if it is the minimum possible.

Analogy

Thinking about people eating again:

  • The greedy eat till there’s no more food.

  • The non-greedy eats a little check if it’s enough, so eat a little more and check if it’s enough, and so on.

  • +1 I liked the analogy :p. It is a good parable

16

Greedy: search until the end of the string the last occurrence, ie, take as much as possible.

Ex.:

Regex: maça .*
Input: maça melancia laranja

What the regex will take is: maça melancia laranja


Not greedy search until the first occurrence, ie, take as little as possible, examples: ?, {1}

Ex.:

Regex: maça .*?
Input: maça melancia laranja

What the regex will take is: maça


Expressões regulares Gulosas consume more resources than Não-Gulosas?

Analyzing the quick explanation above, we can conclude that the utilizing of a gluttonous regular expression in long texts will consume more processing and memory.

10

A Regex with a sweet tooth (Greedy) try to match as many times as possible the specified pattern, use the asterisk to set this *.

A non-greedy expression combines the text as little as possible, by finding the first occurrence of the pattern it to.

Some greedy expressions tend to consume more resources because they analyze the text in contrast to the non-greedy ones that analyze part of it.

4

The other answers have already explained what it is, but I think it’s worth explaining as regex internally treats greedy and non-greedy quantifiers, even to clarify what was said in one of the answers, who claims that "the greedy quantifier is faster" (only this is not always true as it depends on the expression and the string being checked).

And to better understand the reason, it is worth explaining in detail how the quantifiers "greedy" (or "greedy" work - so much so that in English they are called Greedy) and its non-greedy versions (or "lazy" - Lazy).


For example, if we have this string:

a,b,c,123,def,ghi,jkl,mno,pqr,stu,vxw,yz

And this regex: ,.*,\d (comma, zero or more characters, another comma and a digit). According to regex101.com, this regex needs 48 steps to find a match (the exact amount may vary from a engine for another, but in general, for this specific example, we will see that the quantifier Greedy takes longer than the Lazy).

Already if we use the non-greedy quantifier/Lazy, regex needs only 9 steps to find a match.

This difference is not because the greedy quantifier tries to take as many characters as possible, but yes because of the way it’s done. And how does he do it? No regex101.com it is possible thresh every step of engine, but basically she does this:

  • regex moves forward in the string until it finds the first comma (in this case, it is the one just after "a")

  • The .* is greedy/Greedy, then it goes to the end of the string (since the point corresponds to any character, and how the * is greedy, so he grabs everything at once, advancing to the end of the string). No regex101 (no link already quoted) you can see this action in step 3

  • Then regex checks the second comma of the expression (the one after the .*). But since the regex went to the end of the string, there is no later comma, so it does the backtracking, I mean, she turns around and tries again. In this case, it returns a position before the end (i.e., to the "z"), and sees if after it has a comma. Since it does not have, it returns one more position (to the "y"), sees if it later has a comma, and since it does not have, it returns one more position and finds the comma that is before the y

    • Just to remember, so far, the regex took everything from the first comma (after the "a") to the last comma (before the "y"), and this all corresponds to the excerpt ,.*,. I mean, we still have to check the \d:

      ,.*,\d
         ↑
         regex verificou até aqui (a segunda vírgula)
      
      a,b,c,123,def,ghi,jkl,mno,pqr,stu,vxw,yz
                                           ↑
                         ela já consumiu a string até este ponto
      
  • But as after the comma has the "y", which does not correspond to \d, then she needs to come back again

  • Then she goes back to the "w", then to the "x", to the "v", and finds another comma (the one before the "v") - in the link to debug of regex101.com, this is step 12

  • Then she sees if after this comma there is a digit. But how it has the letter "v" (which does not correspond to \d), she returns again (for the "u", "t", "s", finds another comma, checks if later has \d, etc....)

This comes and goes is repeated several times, until she finally finds the comma before the "1", and sees that after her has the 1, which corresponds to \d, and finally finds the match, that is ,b,c,1.


It has already been debug a regex with quantifier Lazy, we’ll see why she does in less steps:

  • regex moves forward in the string until it finds the first comma (in this case, it is the one just after "a")
  • Like .*? is lazy/non-greedy, first he tries as little as possible, which in the case is zero (because the quantifier * means "zero or more occurrences"). That is, it sees whether after the first comma it has "zero characters followed by comma". But as after the first comma has a "b", then it consumes that "b" (i.e., it becomes "part" of the .*?) and see what’s next
  • After the "b" has a comma, then the regex checks if it then has a \d. But since you don’t have, the .*? takes into account the entire "b,c" section. Then regex sees that after that it has a comma and a comma \d and finds the match ,b,c,1

Doing a basic Python test, with module timeit:

from timeit import timeit
import re

# executa 1 milhão de vezes cada teste
params = { 'number': 1000000, 'globals': globals() }

s = 'a,b,c,123,def,ghi,jkl,mno,pqr,stu,vxw,yz'
greedy = re.compile(r',.*,\d')
lazy = re.compile(r',.*?,\d')

print(timeit('greedy.search(s)', **params))
print(timeit('lazy.search(s)', **params))

The exact time may vary depending on the hardware, etc, but overall, the greedy regex took longer (on my machine, it was between 60% to 70% longer), and on Ideone.com and in the Repl.it the results were similar.

Javascript-enabled, a regex Lazy also showed faster.

Anyway, "being faster" is not about being greedy or not, it’s about the expression and the string being checked.

Just change the string from the above example to, for example, a,b,c,x23,def,ghi,jkl,mno,pqr,stu,vxw,1z, that the greedy regex becomes faster. This happens because of the way the match is done: in the greedy regex, we have seen that it goes to the end of the string, and then starts to come back until you find a match, as long as the Lazy starts from the beginning of the string and progresses slowly until finding a match.

So if the match is closer to the beginning of the string, regex Lazy find faster, but if the match is closer to the end of the string, the greedy regex will be faster (see here and here as the situation reverses if we change the string).


Finally, it is important to understand the working differences, rather than blindly trusting that the greedy quantifier will always be faster. There are cases and cases, and as already explained in the other answers, even the match found may be different.

Depending on the engine/language/tool, internal optimizations can also be made depending on the circumstances (both the expression and the string being checked), which may or may not influence the amount of steps needed.

2

Although this question was asked several years ago, there are some misconceptions that need to be corrected.

The * is not the only greedy quantifier.
The greedy quantifiers are: ? + {} *
For example, the optional ? box 0 or 1 time the previous item.

Texto      ER         Casou
aa         ...?       aa
aaa        ...?       aaa
See that in the second text the ER married aaa instead of aa.
Greedy quantifiers will always marry as many as he can.
In the first text ER married aa just because there was no third character.

Already the {} it’s just not greedy if you use it in the form {numero}.
As in {3}. In this case he is exact.
In the other ways he’s also greedy.

The non-greedy quantifiers are: ?? *? +? {}?
That is, the greedy quantifiers "glued" with a ? on your right.
Unlike the greedy ones, they always marry as little as possible.

Texto      ER         Casou
aa         ...??      aa
aaa        ...??      aa 

Note that although the second text is larger, the ER still married aa.
Why? Because the ?? is not greedy. He married as little as possible.

A non-greedy quantifier shows its true usefulness when there is an element on your right. It matches the characters of the text, respecting the maximum amount, until you find a character that also matches the element on your right. in this case, he, as a gentleman metacharacter, yields the character to his neighbor, who then marries with this character.

Already the greedy goes marrying everything he can, respecting the maximum amount, until he has nothing else to marry. Then the foma looks to his right, only to see his neighbor staring at him ugly. Under free and spontaneous pressure, he will peel characters, one by one, from right to left, until he finds the character that his grumpy neighbor wants to marry, yielding this character to him.

Although the description says "character" in the singular, the neighboring element does not necessarily need to marry only one character. He could be the metacharacter of the group (), grouping several representative metacharacters and literal characters and/or it being followed by a quantifier, thus representing more than one character.

To demonstrate, see two ER applied in the html code below:

<b>bold</b>
<p>paragraph</p>
<Important Strong></Strong>
<H1>title</H1>

With the ER <.*>:

<b>negrito</b>
<p>parágrafo</p>
<strong>importante</strong>
<h1>título</h1>

And with the ER <.*?>:

<b>bold</b>
<p>paragraph</p>
<strong>important</strong>
<h1>title</h1>

The highlighted and bold part is what the ER married. Observe how the ER with the quantifier * married everything from the first sign of minor(<) to the last sign of major(>). While the ER with the *? only married with the html tags.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.