Regular Expressions: Lazy quantifier function "?"

Asked

Viewed 343 times

13

I have learned about the use of Regular Expressions and read some explanations about the use of the sign ? (called Lazy quantifier), as in this Microsoft documentation:

*? : Corresponds to the previous element zero times or more, but as few times as possible.

In Regexr shows me the following explanation:

inserir a descrição da imagem aqui

If I use that expression so much ^(.*)$ how much ^(.*?)$ or ^(.+?)$ I get the same result:

var string = "abcdef";
var re1 = /^(.*)$/;
var re2 = /^(.*?)$/;
var re3 = /^(.+?)$/;
console.log(string.match(re1));
console.log(string.match(re2));
console.log(string.match(re3));

What wasn’t clear to me was "...but as few times as possible.".

How so, as few times as possible?

In which situation to use and when not to use the ? to quantify the * or the +, since, as in the examples above, using or not, the result is the same?

3 answers

12


You get the same result because the $ force the end of the capture group to be at the end of the string. Thus, the smallest possible number of times still goes necessarily to the end of the string.

The ? picks up the * or + as few times as possible, in contrast to the regex default which is to be "Greedy", that is, take as many characters as possible that fit the pattern. Consider the following:

var string = 'Eu quero pegar uma palavra que comece com "q" seguida por um espaço.';
var re1 = /(q.*) /;
console.log(string.match(re1)); 
// "quero pegar uma palavra que comece com \"q\" seguida por um"

Notice that instead of just taking the "want," the capture group took the quero, which begins with "q", and everything else until the last space found. So the regex is greedy: he takes everything he can.

Let’s see now the same thing with the ? that takes away the greed of the regex:

var string = 'Eu quero pegar uma palavra que comece com "q" seguida por um espaço.';
var re2 = /(q.*?) /;
console.log(string.match(re2));
// "quero"

This is more in line with what we wanted. regex picked up in the capture group my word which starts with "q", and stopped at the first space; I mean, he gave match in as few characters as possible that satisfies the regular expression.

In its original example, the smallest possible number of characters was still the complete string because with the ^ and the $, you forced regex to start at the first character of the string and end at the last. Thus, the smallest string that satisfies the full expression is the whole string.

  • Good explanation, but testing in Regexr he is selecting everything that starts with "q" until the first space: https://regexr.com/3r7lk

  • Ah tah, it was because of the flag g.

  • @dvd not because of flag g; is because regex Engins usually provide two results: the whole match, or just the one in the capture group. Space, in the case of these regexes, is outside the capture group, so what I commented on was the result of the capture groups. If you run in the browser console, you will see that the result is actually an array of the two values (with and without space).

  • Back at Regexr I defused the global flag and only captured the "I want"

  • @funny dvd. The regex101 shows separate groups and full Matches.

  • Because eh, it seems that the two sites show differently the results.

Show 1 more comment

4

Imagine you have a giant HTML page in a string s and you search the regex /<[^>]+>.*<\/[^>]+>/. What will you get? Yes, the html tag, the whole rest of the document and html closes it, because the normal quantifiers are gluttonous.

Now, if you have the same string s and a regex /<[^>]+>.*?<\/[^>]+>/, what will you get? Any tag that opens and closes in the first snippets of the document.

$(function () {

  var s = $('#conteudo').html();
  
  alert('Original: ' + s);
  alert('Eager: ' + s.match(/<[^>]+>.*<\/[^>]+>/gm)[0]);
  alert('Lazy: ' + s.match(/<[^>]+>.*?<\/[^>]+>/gm)[0]);

});
* { font-family: Consolas; }
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id='conteudo'><table><thead><tr><th>Produto</th><th>Preço</th><th>Quantidade</th></tr></thead><tbody><tr><td>Feijão</td><td>R$ 8,75</td><td>1</td></tr><tr><td>Arroz</td><td>R$ 4,99</td><td>2</td></tr></tbody><tfoot><tr><td>Total</td><td></td><td>R$ 18,73</td></tr></tfoot></table></div>

  • 1

    Because eh, I’m still in the first steps of regex, but I’ve learned a lot. I just had this doubt. Obg!

  • I think this site is very good: https://www.regular-expressions.info/

4

Complementing the other answers, the ? as an indicator of lazyness is not limited to being used only with * and +. It can be used with any other quantifier, like the {}, for example.

let s = 'a1234567890';

// sem lazy, pega a maior quantidade possível
console.log(s.match(/a\d*/)); // a1234567890
console.log(s.match(/a\d+/)); // a1234567890
console.log(s.match(/a\d{3,7}/)); // a1234567
console.log(s.match(/a\d{3,}/)); // a1234567890

The above expressions correspond to a letter a followed by a certain number of digits (\d followed by a quantifier).

In the above cases there is no ?, then the quantifiers are in their mode default, that is to be "greedy" (Greedy), that is, they try to take as many characters as possible that satisfy the expression. To * and +, there is no upper limit (the * corresponds to zero or more occurrences, and + corresponds to one or more occurrences), so the "greatest possible amount satisfying the expression" corresponds to "all digits" (1234567890).

Already \d{3,7} means "at least 3, at most 7 digits", so the maximum possible amount that satisfies the expression is 7, so the result is a1234567. And \d{3,} means "at least 3 digits, no maximum limit", so the maximum possible amount that satisfies the expression is "catch all digits", so the result is a1234567890.


Now if we switch all these quantifiers to their mode Lazy:

let s = 'a1234567890';

// com lazy, pega a menor quantidade possível
console.log(s.match(/a\d*?/)); // a
console.log(s.match(/a\d+?/)); // a1
console.log(s.match(/a\d{3,7}?/)); // a123
console.log(s.match(/a\d{3,}?/)); // a123

Now, thanks to the ? soon after, the quantifiers become "lazy" (Lazy), and begin to take the minor possible amount of characters satisfying the expression.

Like * means "zero or more occurrences", the smallest number of digits satisfying the expression is zero. Hence the result of the match is a.

Already the + means "one or more occurrences", so the smallest number of digits satisfying the expression is one. So the result is a1.

Finally, {3,7} means "at least 3, at most 7 occurrences" and {3,} means "at least 3 occurrences, no maximum limit", so for both the smallest number of digits satisfying the expression is three, so the result is a123.


It is worth remembering that the ? is also a quantifier (meaning "zero or an occurrence", which is another way of saying that something is optional), and therefore it can also be Greedy (greedy) or Lazy (lazy):

let s = 'a1234567890';

// sem lazy, pega a maior quantidade possível
console.log(s.match(/a\d?/)); // a1

// com lazy, pega a menor quantidade possível
console.log(s.match(/a\d??/)); // a

In the first case, \d? indicates that the digit is optional. But by default, every quantifier is greedy, so he tries to take as many characters as possible (which in this case is 1), so in the first case the return is a1.

In the second case, the quantifier ?? is lazy (the first ? indicates that the \d is optional, and the second ? makes the quantifier Lazy) and takes as few characters as possible, which in this case is zero. So the return is a.


When to use? Depends

It’s very common for people to use .*? without even stopping to think if this is what they need, simply because it is "easier". This usually happens because they first try with .*, that generates the problems pointed out in the other answers: how the point corresponds to any character (except line breaks, unless set that up), it may end up picking up more characters than it should, if regex finds it necessary. And then end up changing to Lazy and "ready".

Using the example of answer from Peter:

let string = 'Eu quero pegar uma palavra que comece com "q" seguida por um espaço.';
console.log(string.match(/(q.*) /)[1]); // quero pegar uma palavra que comece com "q" seguida por um
console.log(string.match(/(q.*?) /)[1]); // quero

The goal was to pick up a word that starts with "q", followed by a space. But as the * is greedy, by default he tries to pick up as many characters as possible. And as the point corresponds to any character (including space), it ends up taking more spaces. To avoid this problem, the quantifier was used Lazy, so the regex does not advance more than "should" and only picks up the first space.

But in that case you wouldn’t need to use the point, and neither would the quantifier Lazy. If you want to stop at the first space you have after the "q", you better be more specific and say exactly what you want. In this case, you do not want "anything", but "anything other than a space". Therefore, the regex could be:

let string = 'Eu quero pegar uma palavra que comece com "q" seguida por um espaço.';
console.log(string.match(/(q[^ ]*) /)[1]); // quero

I used the character class denied [^ ] (anything that nay is space) - note that there is a space before the ]. So the quantifier doesn’t even have to be Lazy, for the [^ ]* already assures me that regex will only advance until it finds a space. This leaves not only regex clearer as to your intention, but also more efficient: just compare here and here the number of steps each performs and see which option with [^ ] needs fewer steps than .*?.

This happens because, although the quantifier Lazy .*? work, it charges its price. Basically, after finding the first q (in the word "I want"), .*? check if soon after has a space (he is lazy, so first he tries with zero characters after the q). But as soon as there is no space, the regex comes back and the .*? finds the u, and then checks if the next character is a space. As it is not, it comes back and the .*? consumes the lyrics e, and moves on to see if there’s a gap later. Since there isn’t, she comes back and takes the r, and so on, until we find the space. All this back and forth is called backtracking, and it’s a costly process, depending on the regex and the strings she’s checking. The point is that regex will always test all possibilities, until it finds a match (or even realize that there is none). And using the point, the possibilities increase even more, since it corresponds to any character.

Already using [^ ] (or any other expression more restrictive than the point), things change: as the [^ ] was used with a quantifier Greedy, the behavior of regex is to advance at once, as many characters as possible - and in this case, it advances until finding a space, which is exactly what we wanted (and the best: without the backtracking which occurs with .*?). That is, I took advantage of the greedy quantifier behavior to make regex more efficient (if I make it Lazy there will be no gain, for it will be done backtracking, see). Also, the fact that I use a more specific expression (anything other than space) instead of the point (which can be anything), assures me that regex will not go any further than it should.

Obviously that for small or regex strings that will be executed a few times, in situations where performance is not so critical, etc., the difference is not so great, and in any case if efficiency is an important requirement, tests must be carried out to determine whether the regex is indeed the bottleneck. And the exact amount of steps executed by regex may vary depending on the situation, as some languages and Engines can perform certain optimizations depending on the expression and/or strings involved (but the backtracking caused by a quantifier Lazy continues to be a factor that worsens performance, regardless of that).

Anyway, I still think it’s important to know that the indiscriminate use of .*? not always the best solution. Not always you want "anything", often you want "anything, as long as it’s not X, Y or Z".

Note: there is still another difference. The point, by default, does not consider line breaks, but [^ ] yes (see). If you don’t want to catch the line breaks, you can switch to q\S* (the shortcut \S is everything that is not space or line break - see), or something more specific like q[a-z]* (see). Anyway, there are several options for not needing the quantifier Lazy.


Another case where the use of the quantifier Lazy Not so interesting is when you want to catch "everything up to the end of the line". As the point, by default, does not consider line breaks, I could do:

let string = "Vou pegar tudo depois do número 1: esse é o trecho que quero pegar, até o final da linha\nEsse trecho está em outra linha e não quero";
console.log(string.match(/\d: (.*)/)[1]); // esse é o trecho que quero pegar, até o final da linha

In this case, the regex has \d: (a number followed by : and space), and then I have (.*) (zero or more occurrences of anything). But by default the point does not consider the line breaks, so it advances until find the \n. I take advantage of the behavior Greedy to make the regex go where I want without having to backtracking (see here her working).

But if I change the regex to \d: (.*?), it does not take the rest of the line (because it will take the least amount of characters, and in this case it is zero, as the first examples above) - see. In this case I would have to indicate that I want everything up to the line break by changing the regex to \d: (.*?)\n, for example. But there the quantifier Lazy will make the backtracking to check if you have reached the end of the line, making regex less efficient, see.

In short, not always the quantifier Lazy is the best solution (moreover, it is not always the solution).


For your examples, the result is the same because you used the markers ^ and $ (string start and end), then ^.*$ and ^.*?$ will pick up "anything from the beginning to the end of the string", IE, whatever, because you will always find a match (the difference is that .* will give match in empty strings, while .+ requires there to be at least one character). But there is still a difference in performance, see here, here, here and here.

What about the indiscriminate use of .*? and the consequences of forcing the regex to do many backtrackings, worth reading this article.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.