Regex - set 2 limits and pick up all content inside

Question

Regex - set 2 limits and pick up all content inside

Asked 5 years, 12 months ago

Viewed 319 times

5

In the string below, I need to get the text content from ENERGIA ELETRICA CONSUMO up to the percentage symbol %. That is to say:

ENERGIA ELETRICA CONSUMO kWh 370 0,787324 291,31 291,31 29,00%

Is it possible to do this through regex? I haven’t found any tutorial that explains how to set 2 limits and pick up all the content between them.

"5758 6128 370 kWh 1 370 kWh 11,93 kWh 12/07/2019 01/08/2019
Mês kWh Dt.Pgto. Valor
06/2019 355 17/06/2019 343,79
NOTA FISCAL/CONTA DE ENERGIA ELÉTRICA N° 085.976.711 - SÉRIE B
Emitida em 03/07/2019
05/2019 332 14/05/2019 306,84
Produto Valor Valor Base Aliq.
04/2019 415 29/05/2019 378,67
Descrição Un. Consumo Unitário Total Cálc. ICMS
03/2019 308 14/05/2019 269,81
ENERGIA ELETRICA CONSUMO kWh 370 0,787324 291,31 291,31 29,00%
02/2019 347 26/04/2019 305,09
ENERGIA CONS. B.AMARELA kWh 0,54 0,54 29,00%
01/2019 341 11/03/2019 324,39
12/2018 301 26/12/2018 287,83"

4 answers

5

Whereas the terminal delimiter can only appear once, and that it is only one character in your example is the %), I made a small adjustment in nullptr response to such an end:

ENERGIA ELETRICA CONSUMO[^%]*%

Now, in general terms:

For any initiating sequence INIT, can be repeated in the sequence, and a terminator @ of a character that cannot be contained except at the end of the sequence:

INIT[^@]*@

For any initiator of a character £ which cannot be repeated inside and a terminator of a character which also cannot be repeated:

£[^£@]*@

If it is plausible to accept an escape character (pretending to be the #), in which the escape may escape as well, and which the initiator £ or the terminator @ can be escaped in the middle of the sequence (cannot appear free):

£([^#£@]|#.)*@

Here, worth an explanation:

begins with £, as expected from an initiator
ends with a @, as expected from a terminator
may contain a group with several repetitions of g1 or g2
g1 is any character that is not escape, initiator or terminator
g2 is the exhaust followed by anything, including (but not limited to): initiator, exhaust and terminator

As a consequence of the assembly of g1, an escape may only appear in the g2. And g2 begins with a guaranteed escape and has another character; therefore, there are no missing escapes, they are always followed by something.

If by chance the sequence that cannot be repeated in the middle is more than 1 character, the thing gets a little more complicated and ugly to write.

Note that I am using here only "pure" regex, possible to be represented by a finite state automaton; therefore, rearview mirrors are outside my writing scope below

Take, for example, the sequence @€ as terminator, and I need it not to repeat itself. Assuming the initiator £ may be repeated, and also in the absence of leaks.

A simple @ does not mean that the sequence has been filled in. For this, it is necessary that the next character is not €. And it may also occur the case that the @ be the part just before the terminator @€. So I need to take, in the heart of the regex:

one @ by anything that is not €
own € provided that there is no marriage to something that ends in @ in the middle of repeating
possibly a fraction of the terminator

I can separate the core in the repeatable part and the incompleteness part. Thus, the incompleteness part, for a 2-character terminator only, is (@+)? (I already explain the Kleene cross). And the part that can be repeated?

([^@]|@+[^€])*

So I can have an arbitrary sequence within that repetition that, guaranteed, will not end with @. The repetition is composed of 2 groups: g1 which is anything but the first character of the terminator, and g2, which is the initiator character followed by necessarily something that breaks the terminator sequence.

So, to also allow the initial part of the terminator, I allow its first character to be repeated infinitely, optionally, after this sequence, thus remaining the entire regex:

£([^@]|@+[^€])*(@+)?@€

What if the terminator was 3 characters long? Like @€¥?

Well, the idea is similar, but you’ll need to deny the first character, the second character, and the third character the sequence. How do we do this?

the negation of the first character is direct: [^@]
the denial of the second character onwards must assume the presence of the first character, with the possibility of repetition, so I will put them all in a group preceded by @*
the negation of the second character (considering that the repetition of the first character was treated externally), begins by assuming that its marriage is positive for the first character and denying the second character: @[^€]
the negation of the third character (also assuming that the repetition of the first character has already been treated) needs to assume that the first two characters have worked: @€[^¥]

So that leaves the repeatable part:

([^@]|@*(@[^€]|@€[^¥]))*

For the incomplete sequence part, only remove the denied lists and the final Kleene star, replacing it with the possibility of presence. I’ll denote how '' the string of length 0 just for the sake of viewing, then eliminating it:

(''|@*(@''|@€''))?

As it makes no sense something concatenated from the empty string in the desired regex:

(''|@*(@|@€))?

As it makes no sense the option between anything or other, always falling on the other thing:

(@*(@|@€))?

I could go on with just that, but if you pay attention, it can be replaced by something more expressive:

(@*(@(€)?)?)?

Where each parenthesis after the repetition of the first character of the ending sequence indicates that the subsequence is optional. Note, also, that it would only make sense to marry in this group if and only if it has at least a part of the sequence, and that it must necessarily be the first character. So it could be rewritten like this:

(@+(€)?)?

Apart from redundant parentheses:

(@+€?)?

Note, however, that this same algorithm could be used for an arbitrary string abcdefX:
(a+(b(c(d(ef?)?)?)?)?)? 
That expression box with any subsequence from the beginning of abcdefX

And what the whole expression would look like?

£([^@]|@*(@[^€]|@€[^¥]))*(@+€?)?@€¥

You can expand that logic to more characters, but I have to assume it’s quite laborious and the size increases exponentially. Also I did not take as a consequence the possibility of having repeated characters in the ending sequence, this may eventually generate a complex case that has not been properly treated.

2

Very good (I already voted yesterday, but I just came to comment now)! About the £([^#£@]|#.)*@, I think it is worth mentioning that in these cases it is possible to apply the "unroll the loop". Compare here and here the difference in the amount of steps of each

– hkotsubo

2019/07/30 at 12:22
I do not know this "unroll the loop", I will study on the subject and as soon as I can update

– Jefferson Quesado

2019/07/30 at 12:23
excellent explanation, the development of thought until the answer helps to clarify the final result

– Paz

2019/07/30 at 12:26
@hkotsubo, have any questions here at Sopt about the unroll the loop? I’m beginning to understand the technique here, but I must admit my studies are very blurry

– Jefferson Quesado

2019/07/30 at 20:07
Making a quick search couldn’t find. Maybe it has with some other name or related terms (like "regex performance", "regex slow", etc), but I did not get to research so deeply...

– hkotsubo

2019/07/30 at 20:10
@hkotsubo, I found an answer from you that mentions unroll, haha: https://answall.com/a/383091/64969

– Jefferson Quesado

2019/07/30 at 20:14
Yes, it’s the only reference I found... :-) I remember seeing old questions about the performance of regex, but I think they were not specific about this technique, but more generic. Maybe you don’t have a question just about that...

– hkotsubo

2019/07/30 at 20:15
By the way, in that answer you thought I mentioned the book which explains the technique in detail. The reading is dense (already reread a few times and still do not understand 100%), but very interesting :-)

– hkotsubo

2019/07/30 at 20:25

Show 3 more comments

Browser other questions tagged string regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-07-30T12:12:23+00:00

Just by supplementing the other answers, the best option for your specific case would in fact be the first regex of the Reply by @Jefferson Quesado:

ENERGIA ELETRICA CONSUMO[^%]*%

She wears a character class denied [^%], which means "anything that nay be the %". There’s only one detail, if the text has something like:

ENERGIA ELETRICA CONSUMO%

regex will also find a match, see. This happens because the quantifier * means "zero or more occurrences". That is, it may have nothing between "ELECTRIC ENERGY CONSUMPTION" and the %.

To force at least one character between the text "ELECTRIC ENERGY CONSUMPTION" and the %, just change the * for + (one or more occurrences), see the difference. You can still use custom quantifiers, with well defined quantities. For example:

[^%]{5,20}: must be at least 5 and not more than 20 characters %
[^%]{20}: must have exactly 20 characters that are not %
[^%]{5,}: must be at least 5 characters % (no ceiling)

Adjust the quantifier according to your needs. Of course, if you know what cases like ENERGIA ELETRICA CONSUMO% do not occur in your text, whether it makes use of one or the other, but if you want to be more specific to avoid false positives, just choose the most appropriate option.

If the row format always follows the pattern indicated in the text (with that same amount of columns with numbers, for example), you can be even more specific:

ENERGIA ELETRICA CONSUMO kWh \d+( \d+,\d+){4}%

Now I’ve added \d+ (one or more digits), and ( \d+,\d+){4} (space, digits, comma, digits, repeated 4 times), so regex only finds what is in that specific format. If the format varies, then it would be the case to adjust to catch all variations.

Here you should decide whether regex will validate the format of the information, or whether it will only take "anything", until you find the first %. Is always a trade-off: a more complex regex can validate the format and information, but it becomes more difficult to maintain and understand. A simpler regex finds what it needs, with the possibility of bringing in extra things (which would need to be validated later, outside the regex).

by Paz • **3,062** points · Answer 2 · 2019-07-30T12:22:25+00:00

The answer of the user nullptbr is almost correct, but she will capture the first occurrence of ENERGIA ELETRICA CONSUMOuntil the last occurrence of the character %.
So if there is more than one percentage occurrence and the file is not separated by line breaks, there will be unwanted catches.

I recommend using this regex: (ENERGIA ELETRICA CONSUMO(.|\n)*?%)

Functioning:

It will capture from the first occurrence of the sequence ENERGIA ELETRICA CONSUMO until the next occurrence of % may have characters and line breaks between these occurrences.

You can check the operation of this regex here

by nullptr • **3,925** points · Answer 3 · 2019-07-30T00:56:55+00:00

1

Simple as that: ENERGIA ELETRICA CONSUMO.*%

EDIT

Removing the %: ENERGIA ELETRICA CONSUMO.*(?=%)

The exclusion of delimiters can be achieved by lookaround

See working here

Almost perfect! I need it to go only to the first %. Is it possible? Anyway, it has helped me a lot!

– milho

2019/07/30 at 01:05
1

@corn the lord is taking advantage of my good will -.- I’m glad I have plenty :D

– nullptr

2019/07/30 at 01:29
3

Lookahead does not eliminate the problem of picking up extra things if you have more than one %, see (besides the % not being part of the match). In this case, a solution would be to use .*?, see. Not to mention that Lookahead is more inefficient, compare the number of steps here, here and here. Of course, for small strings it doesn’t make much difference, but somehow I don’t think it’s necessary to use it in this case.

– hkotsubo

2019/07/30 at 11:29