How to reduce the capture of groups in a regular expression?

Asked

Viewed 269 times

4

I have the following expression:

/((segunda|terça|quarta|quinta|sexta|sábado|domingo)+(((-feira)?)+(.)+\(([0-9].*)\))?)/im

That brings me:

Show Trueque performs on Saturday at the CDL theater

RESULTADO: sábado

Rock bands perform on Friday (5) in space Marcus Moraes

RESULTADO: sexta (5), sexta, 5, etc.

Decoration shop opens new space on Tuesday (15)

RESULTADO: terça-feira (15), terça, -feira (15), 15 etc.

How do I make this expression less 'greedy''?

  • Testing here seemed normal: http://www.regexr.com/ How is it running?

  • How do you read the results?

  • I am running as preg_match. I just thought maybe I could make it lighter. Since I am using many Group '()'. I’ve read about the group being inversely proportional to performance.

  • 3

    When you say "less 'greedy'" do you mean in the figurative or literal sense? For there is a concept in regex called Greedy (in Portuguese, gluttonous [in this context; lit. greedy]), whose use or use would not affect its results (e.g., gluttonous: "Tuesday (15)", lazy / Lazy: "tuesday"). From what I understand from your previous comment, this is not what you refer to, so I suggest changing the title of the question with an alternative expression ("lighter", or "more concise" seems a good one) not to cause ambiguity.

  • 1

    I would suggest to you that you change your question by adding an important detail: which answer you waiting get from Regex, for all possible cases? With examples or not, but this is something that is not so clear in your question. With the phrase "Rock bands perform on Friday (5) in space Marcus Moraes" is desired ONLY "sixth (5)" as a result?

  • He showed all combinations and results - it’s in the question

  • @mgibsonbr I know it’s been a long time, but I’ve never noticed this question, perhaps by name. I set the title, whoever uses regex knows what a group is, so "reduce the capture of groups" it seems appropriate to me for those who come for searchers, but if you have another idea it will help a lot.

Show 2 more comments

3 answers

4

I believe this is the best (and most simplified) expression for your case. Remember that not only does a regular expression have to be efficient, but easy to understand.

/(((segunda|terça|quarta|quinta|sexta)(-feira)?)|sábado|domingo)(\s*\(\d+\))?/g

Works with:

segunda
terça
quarta
quinta
sexta
sábado
domingo
segunda-feira
terça-feira
quarta-feira
quinta-feira
sexta-feira
segunda (25)
terça (26)
quarta (27)
quinta (28)
sexta (29)
sábado (30)
domingo (31)
segunda-feira (1)
terça-feira (2)
quarta-feira (3)
quinta-feira (4)
sexta-feira (5)

Doesn’t work with:

sábado-feira
domingo-feira
sábado-feita (6)
domingo-feira (7)
-feira
-feira (8)

Link in regexr: http://regexr.com/39e5i

3

In the comments you said you’re using preg_match. And if we look at documentation, it is said that in the results of pouch capture groups are also returned.

In the case, catch groups are created by parentheses, and you use many of them in your regex. There is one group only for the day of the week without the suffix "-feira", another only for the suffix "-feira", another for the numbers in brackets, another around the whole expression, etc.

If you don’t want so many groups and just need the match whole, just turn the parentheses into catch groups, starting them with (?:. The parenthesis around the entire expression is also not necessary, since the section corresponding to the whole regex is always returned. Then it would look like this:

/(?:(?:segunda|terça|quarta|quinta|sexta)(?:-feira)?|sábado|domingo)(?:\s+\([0-9]+\))?/i

In PHP code would look something like:

$str = 'Espetáculo Trueque se apresenta neste sábado no teatro da CDL'.
       'Bandas de rock se apresentam nesta sexta (5) no espaço Marcus Moraes'.
       'Loja de decoração inaugura novo espaço nesta terça-feira (15)';

preg_match_all('/(?:(?:segunda|terça|quarta|quinta|sexta)(?:-feira)?|sábado|domingo)(?:\s+\([0-9]+\))?/i', $str, $matches);
foreach ($matches[0] as $m) {
    echo $m.PHP_EOL;
}

Exit:

sábado
sexta (5)
terça-feira (15)

Now all parentheses start with (?:, which makes them no longer capture groups. Therefore the respective excerpts are no longer part of the match, only the part corresponding to the whole regex.

Other improvements:

  • (?:(?:segunda|terça|quarta|quinta|sexta)(?:-feira)?|sábado|domingo): the alternation says that only Monday to Friday can have a "Friday" after (the ? indicates that the phrase "Friday" is optional). Saturday and Sunday may not have the suffix "Friday".
  • \s+: the shortcut \s corresponds to spaces and line breaks, among other characters (the exact list varies according to the language) and the quantifier + means "one or more occurrences". That is, it may have one or more spaces
  • [0-9]+: one or more digits from 0 to 9. Here you could even use something like (?:3[01]|[12][0-9]|0?[1-9]) to accept only values between 1 and 31 (which are the values valid for the day of the month - being that the days less than 10 may have a zero left, ie accept both 1 how much 01, see), is at your discretion.
  • (?:\s+\([0-9]+\))?: the entire "space + numbers in parentheses" section has a ? soon after, which makes this section optional

In your regex you were wearing the flags i and m (at the end, the /im). To flag i makes the regex case insensitive, then if the string has "Monday" or "SUNDAY", it also finds. Already the flag m changes the behavior of markers ^ and $ (usually correspond to the beginning and end of the string, but with the flag m they change the meaning to the beginning and end of a line). As you do not use these markers, I removed the flag m of regex.


In the above example I used preg_match_all, that brings all occurrences of the string, but if you want to search for only one occurrence, use preg_match:

preg_match('/(?:(?:segunda|terça|quarta|quinta|sexta)(?:-feira)?|sábado|domingo)(?:\s+\([0-9]+\))?/i', $str, $matches);
foreach ($matches as $m) {
    echo $m;
}
  • 1

    This is deserving of a reward, not for the whole answer, but mainly because you quoted the ?: (catch) +1 (soon reward, 7 days)

0

I’ve simplified your expression a little bit to only take the internal groups. I don’t know if the goal is to take a loan by phrase, but what I got was this::

(segunda|terça|quarta|quinta|sexta|sábado|domingo)+((-feira)?(.)+\(([0-9]{1,2})\))?
  • Notice that this expression produces results different from the one that Iago put in the question, perhaps more certain, for lack of a .. Test with sexta (5sddf) you will see.

  • It is. I’ve adjusted.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.