Whereas the terminal delimiter can only appear once, and that it is only one character in your example is the %
), I made a small adjustment in nullptr response to such an end:
ENERGIA ELETRICA CONSUMO[^%]*%
Now, in general terms:
For any initiating sequence INIT
, can be repeated in the sequence, and a terminator @
of a character that cannot be contained except at the end of the sequence:
INIT[^@]*@
For any initiator of a character £
which cannot be repeated inside and a terminator of a character which also cannot be repeated:
£[^£@]*@
If it is plausible to accept an escape character (pretending to be the #
), in which the escape may escape as well, and which the initiator £
or the terminator @
can be escaped in the middle of the sequence (cannot appear free):
£([^#£@]|#.)*@
Here, worth an explanation:
- begins with
£
, as expected from an initiator
- ends with a
@
, as expected from a terminator
- may contain a group with several repetitions of
g1
or g2
g1
is any character that is not escape, initiator or terminator
g2
is the exhaust followed by anything, including (but not limited to): initiator, exhaust and terminator
As a consequence of the assembly of g1
, an escape may only appear in the g2
. And g2
begins with a guaranteed escape and has another character; therefore, there are no missing escapes, they are always followed by something.
If by chance the sequence that cannot be repeated in the middle is more than 1 character, the thing gets a little more complicated and ugly to write.
Note that I am using here only "pure" regex, possible to be represented by a finite state automaton; therefore, rearview mirrors are outside my writing scope below
Take, for example, the sequence @€
as terminator, and I need it not to repeat itself. Assuming the initiator £
may be repeated, and also in the absence of leaks.
A simple @
does not mean that the sequence has been filled in. For this, it is necessary that the next character is not €
. And it may also occur the case that the @
be the part just before the terminator @€
. So I need to take, in the heart of the regex:
- one
@
by anything that is not €
- own
€
provided that there is no marriage to something that ends in @
in the middle of repeating
- possibly a fraction of the terminator
I can separate the core in the repeatable part and the incompleteness part. Thus, the incompleteness part, for a 2-character terminator only, is (@+)?
(I already explain the Kleene cross). And the part that can be repeated?
([^@]|@+[^€])*
So I can have an arbitrary sequence within that repetition that, guaranteed, will not end with @
. The repetition is composed of 2 groups: g1
which is anything but the first character of the terminator, and g2
, which is the initiator character followed by necessarily something that breaks the terminator sequence.
So, to also allow the initial part of the terminator, I allow its first character to be repeated infinitely, optionally, after this sequence, thus remaining the entire regex:
£([^@]|@+[^€])*(@+)?@€
What if the terminator was 3 characters long? Like @€¥
?
Well, the idea is similar, but you’ll need to deny the first character, the second character, and the third character the sequence. How do we do this?
- the negation of the first character is direct:
[^@]
- the denial of the second character onwards must assume the presence of the first character, with the possibility of repetition, so I will put them all in a group preceded by
@*
- the negation of the second character (considering that the repetition of the first character was treated externally), begins by assuming that its marriage is positive for the first character and denying the second character:
@[^€]
- the negation of the third character (also assuming that the repetition of the first character has already been treated) needs to assume that the first two characters have worked:
@€[^¥]
So that leaves the repeatable part:
([^@]|@*(@[^€]|@€[^¥]))*
For the incomplete sequence part, only remove the denied lists and the final Kleene star, replacing it with the possibility of presence. I’ll denote how ''
the string of length 0 just for the sake of viewing, then eliminating it:
(''|@*(@''|@€''))?
As it makes no sense something concatenated from the empty string in the desired regex:
(''|@*(@|@€))?
As it makes no sense the option between anything or other, always falling on the other thing:
(@*(@|@€))?
I could go on with just that, but if you pay attention, it can be replaced by something more expressive:
(@*(@(€)?)?)?
Where each parenthesis after the repetition of the first character of the ending sequence indicates that the subsequence is optional. Note, also, that it would only make sense to marry in this group if and only if it has at least a part of the sequence, and that it must necessarily be the first character. So it could be rewritten like this:
(@+(€)?)?
Apart from redundant parentheses:
(@+€?)?
Note, however, that this same algorithm could be used for an arbitrary string abcdefX
:
(a+(b(c(d(ef?)?)?)?)?)?
That expression box with any subsequence from the beginning of abcdefX
And what the whole expression would look like?
£([^@]|@*(@[^€]|@€[^¥]))*(@+€?)?@€¥
You can expand that logic to more characters, but I have to assume it’s quite laborious and the size increases exponentially. Also I did not take as a consequence the possibility of having repeated characters in the ending sequence, this may eventually generate a complex case that has not been properly treated.
Nor has it occurred to me that the empty brain might not be valid. Thank you for the remark
– Jefferson Quesado
capture group with character negation "%" is genius, excellent response.
– Paz
@Jeffersonquesado It was not mentioned in the question, so it is forgivable this forgetfulness :-)
– hkotsubo
@Peace Credits to Jefferson, who put this option first in his answer :-)
– hkotsubo