Difference between metacharacters . * and +

Question

Difference between metacharacters . * and +

Asked 6 years, 4 months ago

Viewed 135 times

6

Consider this set of strings:

my_names <- c('onda', 'ondas', 'ondass', 'ondassssssss', 'ond', 'on')

Using the R language, I checked the metacharacters .* and + bring the same information:

library(stringr)

str_extract(my_names, 'ondas.*')
#[1] NA             "ondas"        "ondass"       "ondassssssss" NA            
#[6] NA

str_extract(my_names, 'ondas+')
#[1] NA             "ondas"        "ondass"       "ondassssssss" NA            
#[6] NA

I ask you:

What is the difference between the metacharacters .* and +?
When they may present different results?

2 answers

8

To Reply from @Lipespry already explains very well the differences, I would just like to complement with some details.

The first - and maybe I’m being a bit pedantic - is that .* sane two metacharacters: the point (meaning "any character (except line breaks)") and the asterisk, which means "zero or more occurrences".

Already the + means "one or more occurrences".

So the two expressions you used are not equivalent. What happened is that you tested with strings that happen to give the same result with both regex. But that was just coincidence.

ondas.* means: the word "waves" followed by zero or more characters (any characters except line breaks). That is, if you only have "waves", it works. If you have "ondasabc123 Xyz", it also works. And "ondasssss" also works.

ondas+ means: the word "wave" followed by s+ (one or more occurrences of the letter "s"). Therefore it serves for "waves", "ondass" and "ondasssss". But it does not serve for "ondasabc123 Xyz".

So an example of a case where there is a difference between the two regex is:

my_names <- c('ondasabc123 xyz', 'ondassss')

library(stringr)

str_extract(my_names, 'ondas.*')

str_extract(my_names, 'ondas+')

The exit is:

[1] "ondasabc123 xyz" "ondassss"
[1] "ondas"    "ondassss"

Note that the first regex took the whole string "ondasabc123 Xyz" as it actually corresponds to the word "waves" followed by zero or more characters (any characters other than line breaks). And remember that the quantifier * by default is greedy and tries to grab as many characters as possible, so .* take everything you can, until the end of the string.

The second regex only took the "waves" section of the first string, as it is the part that corresponds to the word "wave" followed by one or more letters "s". The rest of the string ("abc123 Xyz") does not match regex (as s+ only takes occurrences of the letter "s"), so this part stays out of the result.

See this example running on Ideone.com.

As already said, the point does not consider line breaks:

my_names <- c('ondasabc\n123', 'ondassss')

library(stringr)

str_extract(my_names, 'ondas.*')

Note that the first string has a line break (the \n). So the result is:

[1] "ondasabc" "ondassss"

The .* takes all characters after "waves", but only manages to go to the line break, since by default the point ignores them. Therefore regex only takes "ondasabc". See this example on Ideone.com.

Since the regex are different, it is up to you to choose the one that corresponds to what you actually need. Ideally the regex say exactly what you want and what you don’t want.

Do you want the word "waves" followed by anything? Then use the first option with .*. Do you only want "wave" followed by one or more letters "s" (and no character other than "s")? Then use the second option with s+.

Do you also want to take the word "wave"? Then you can use ondas* ("wave" followed by zero or more letters "s"), or ondas?.* ("wave", followed by an optional "s" (the s? makes the letter "s" optional), followed by "anything"). Again, the choice depends on what you want after "wave" or "waves": anything or only letters "s".

If you also want to consider line breaks, you can use ondas(.|\r\n?|\n)* (example in Ideone.com). The expression in parentheses uses alternation (the character |, which means or) and covers 3 possibilities: the point (any character except line breaks), or \r\n? (one CARRIAGE RETURN (\r), followed by a \n optional - so we consider the line breaks of macos, which is only one \r, or Windows, which is \r\n), or only one \n (Unix line breaks).

Another option for the point to consider line breaks is use the option dotall (see rotating on Ideone.com):

my_names <- c('ondasabc\n123', 'ondassss')

library(stringr)

str_extract(my_names, regex('ondas.*', dotall=TRUE))

The difference in the case is that enabling dotall, all points of regex (if you have more than one point in different parts of the expression) will be affected and will consider line breaks (example). Using (.|\r\n?|\n) (and without the dotall enabled), only this party considers line breaks, while the other points (in other parts of regex) continue not considering line breaks (example).

Anyway, choose the one that best suits what you need. Regex are not equivalent, so evaluate whether for the strings you’re testing it makes a difference or not. See if they take what you need, but also don’t take what you don’t need (but also assess whether some errors are acceptable or not, and whether it’s worth further complicating the regex to be more accurate).

1

Look at him! kkkkkk

– LipESprY

2019/03/31 at 13:28
2

@Lipespry I arrived a little late, but I managed to give my contribution. Thank you for recommending my answers, a sign that my studies are working :-)

– hkotsubo

2019/03/31 at 13:29
2

Your answers are giving show in much documentation. kk

– LipESprY

2019/03/31 at 13:29

Browser other questions tagged r regex

You are not signed in. Login or sign up in order to post.

by LipESprY • **4,525** points · Answer 1 · 2019-03-31T08:05:40+00:00

What is the difference between the metacharacters .* and +?

The asterisk and more are quantifiers, where:

* corresponds to none or unlimited occurrences;
+ corresponds to one or unlimited occurrences;

Already the point (.), when out of a class ([]), corresponds to any character except new line. But this behavior may vary according to the flags (and/or language). When escaped (\.), corresponds to the literal point.

Given the examples:

ondas.*

Corresponds to ondas followed by no or unlimited occurrences of any character (.), except new line;

ondas+

Corresponds to onda followed by one or unlimited occurrences of s. That’s why your expression found, too, ondass and ondassssssss. But not onda. The quantifiers are relative to the expression that precedes them. Hence one of the functions of the groups: (...). See a example:

With the expression (ondas)+ in the text:

onda
ondas
ondass
ondassssssss
ond
on
ondasondasondasondas

Will be found one or unlimited occurrences of ondas:

onda
'ondas'
'ondas's
'ondas'sssssss
ond
on
'ondasondasondasondas'

If you want to delve into Regex, consider reading replies from @hkotsubo that are related. xD