Optional group is not being captured

Asked

Viewed 98 times

1

I have a huge file of records that I want to turn into a table. The file looks like this:

********************
SemprePresente1=09/2019
SemprePresente2=987456
Um monte de coisas  
Que não preciso
Opcional=698,00
Mais coisas que não preciso
********************
SemprePresente1=06/2019
SemprePresente2=123658
Um monte de coisas 
Que não preciso
********************
SemprePresente1=09/2019
SemprePresente2=987699
Um monte de coisas
Opcional=9999,00
Mais coisas que não preciso

I can capture the first two groups, but not the third, which is optional, using the following regex:

^[\*].+?SemprePresente1=(\d\d\/\d\d\d\d).+?SemprePresente2=(\d{6}).+?((:?Opcional=)[\d,]+)?[^\*]+

I need something like this:

09/2019;987456;698,00
06/2019;123658;
09/2019;987699;9999,00

However, using the replacement pattern \1;\2;\4\n on Notepad++ I only get this:

09/2019;987456;
06/2019;123658;
09/2019;987699;

Why am I not able to capture the optional group? Apparently, the pattern is matching the entire record without "hacking" the next.

1 answer

2


You can use this regex:

^[\*].+?SemprePresente1=(\d{2}\/\d{4}).+?SemprePresente2=(\d{6})(?:(?!Opcional=)[^\*])+(Opcional=(\d+,\d+))?[^\*]+

See here it working in regex101.com.


I made some changes to your regex. Basically, I used the quantifiers {2} and {4}, which are respectively "2 occurrences" and "4 occurrences". That is, \d{4} is the same as \d\d\d\d.

Another detail is that :? means that the character : is optional, that is, may or may not have the character : string. I think you actually tried to use (?: to create a catch group.

Anyway, regex does not work because the optional group (containing the "Optional" string) was skipped. First regex falls on .+? right after "Semprepresente2", and how this quantifier is "lazy", he tries to pick up as few characters as possible. That is, the regex has not yet "walked" enough characters to arrive in the "Optional" (because it is just after the number corresponding to "Semprepresente2"), but it already tries to check this stretch at this position of the string.

But since this section is optional, regex first tries to check whether nay There’s this part, and only then, if it doesn’t work out, it comes back and tries to see if that optional part exists. So first she skips this group and checks the expression that comes after.

And after the group "Optional" we have [^\*]+, and by default quantifiers are "greedy" and try to pick up as many characters as possible. In this case, it picks up several characters that are not *, that is, it advances until it finds a *, and no longer comes back to check the optional group.

You can see this behavior here: use the arrow keys on the keyboard to follow the steps that regex does (see from step 68 of the first match).


The solution was, before the optional group, to use a Lookahead negative (the stretch between (?!...)), that something checks out nay there is ahead. The "trick" of Lookahead is that he only sees what is ahead, and then goes back to where he was and continues to check the regex.

In this case, the passage is (?:(?!Opcional=)[^\*])+, that is, first I see if nay has the "Optional=" section in front, and then I check if the following character is not *. And I do it over and over (thanks to the quantifier +). This ensures that the optional group will only be checked when we have actually made sure it exists (or, if it does not exist, the regex ends up arriving at a * and closes).

After the Lookahead, we have the stretch Opcional=(\d+,\d+), that takes the number you need. Note that I changed the expression too, as you were using [\d,]+, that although it works, it can also pick up invalid things like ,,,, and 2,2,3,4 (see).

Making the replacement by \1;\2;\4\n on Notepad++, I got:

09/2019;987456;698,00
06/2019;123658;
09/2019;987699;9999,00

For more details on the "lazy" and "greedy" quantifiers, see this answer.

Remembering also that by default, the point does not consider line breaks, then you must enable this option in Notepad++, for your regex to work (since she uses .+ to pick up stretches between different lines).

  • 1

    Thank you very much! Not only did it solve my problem, but the detailed explanation helped me understand what I was doing wrong.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.