Difference between the greedy ?? and *quantifiers?

Asked

Viewed 250 times

16

I have these strings:

x <- c('ondasffasf', 'ondassn\nlld', 'ondas', 'ond', 'ndasss', 'das')

And this code with ??:

library(stringr)

str_extract(x, regex('ondas??'))

[1] "onda" "onda" "onda" NA     NA     NA 

And also with *?:

str_extract(x, regex('ondas*?'))

[1] "onda" "onda" "onda" NA     NA     NA  

Both return the same result. I tried to change the strings inside the array to check if the result changed, but nothing helped.

  • What’s the difference between the greedy quantifiers ?? and *??

1 answer

15


Short answer

For the regex you used, it makes no difference. But there are cases where you do.

Long answer

Before we remember what these are quantifiers.

The ? means "zero or an occurrence", which is another way of saying that something is optional. And the * means "zero or more occurrences" (i.e., in addition to being optional, one can have an unlimited amount).

For default, they are "greedy" (Greedy), that is, try to get as many characters as possible. But when we put a ? in front of them (i.e., ?? and *?), they become "lazy" (Lazy or non-greedy), and take the smallest amount possible character.

The point is that the "smallest possible amount" depends on the context. In your case:

  • ondas?? meaning: the word "wave", whether or not followed by a s, and taking as little as possible s. In this case, the smallest possible amount is zero
  • ondas*? meaning: the word "wave", followed by zero or more letters s, and taking as little as possible s. In this case, the smallest possible amount is also zero

For both, it doesn’t matter if there are letters s after "wave". As the quantifier is lazy, it will always catch as little as possible. And in this case, if I catch zero letters s, the regex is already satisfied, so the pouch found in all cases is the string "onda".

-But there’s a s after "wave", catch him!
-No, I’m just lazy...

Anyway, that’s what laziness is all about: s after, but without the s the expression is already satisfied, so why catch it?


So when it makes a difference?

Use ?? or *? makes a difference if regex has something after them. Ex:

library(stringr)

x <- c('ondaX', 'ondasX', 'ondasssX', 'onda', 'ondas')

str_extract(x, regex('ondas??X'))

str_extract(x, regex('ondas*?X'))

See here this code running

A regex ondas??X search for "wave", whether or not followed by a s, followed by a X. That’s why she only finds pouch in the first two strings:

[1] "ondaX"  "ondasX" NA       NA       NA 

The first string ("ondaX") works because it is the word "wave", followed by zero letters s, followed by the letter X (in that case, as little as possible of s who has a X soon after is zero). The second string works because it is the word "wave", followed by a letter s, followed by the letter X (in that case, as little as possible of s who has a X soon after is a).

The third string does not give match because there is more than one s between "wave" and X. As the quantifier ?? only takes zero or one letter s, she does not consider cases where there is more than one s (in that string, the smallest possible amount of s who has a X soon after it’s 3, but like the ?? only considers zero or one occurrences, regex does not find a match).

The last two do not give match because they don’t have a X.


Already the regex ondas*?X search for "wave", followed by zero or more letters s, followed by X. That’s why it accepts the first 3 strings:

[1] "ondaX"    "ondasX"   "ondasssX" NA         NA    

They all have "wave," then there’s zero or more s, and then there’s a X. Note that the "smallest possible amount" of s who has a X then varies: first string is zero, second string is 1 and third string is 3.

regex always tries various possibilities until you find the smallest amount that satisfies the expression (or until you see that there is none match). First she tries with zero occurrences of s. If it doesn’t, try with one, if it doesn’t, try with two, and so on, until you find the least amount of s who has a X soon after.


As in his regex he had nothing after the s, regex doesn’t need to check for something later. It can assume that the smallest possible amount that satisfies the expression is zero - that is, even if it has a s after "wave", she will not catch it, because the sloth speaks louder.

Already in my examples above, regex needs to check if you have a X after the s, and will only stop when you find (or when you test all the possibilities and see that there is none). Though lazy, she always does what is asked.


On the use of the quantifier "lazy", you can see more cases in this answer.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.