I don’t know in detail what software you are using, but how you are testing on the site regex., I’ll assume the resources PCRE are available. (basically there are several "Flavors" different from regex and there is a lot of variation between languages and tools, so I leave below a more general solution, which should work for your case)
I do not understand why you find your regex "vulnerable". Maybe it is not the most efficient, since it has several alternations (the character |
, which causes it to test each of the alternatives until it finds one) and many of them are somewhat redundant (you test several different combinations of numbers separated by colons and commas).
The number you want to take seems to be a monetary value, always using the point as the separator of the thousands and the comma to separate the pennies. So you can use:
\d{1,3}(?:\.\d{3})*,\d{2}
You were using \d+
, which means "one or more digits", so \d+,\d+
accepts values as 12345,12345
. Using quantifiers more specific it is possible to limit the exact quantities we need.
In the case, \d{1,3}
is "at least 1 and at most 3 digits". Next I have the section in parentheses (?:\.\d{3})
, which is "point, followed by exactly 3 digits". And just after the parentheses we have *
, which means "zero or more occurrences". This means that all this "point followed by 3 digits" can be repeated several times (or none at all). This ensures that values such as 1,23
, 123,12
, 1.212,21
and 8.571.962,06
.
Then you used it \s
, that really picks up the line breaks, but also picks up spaces and the character TAB (among others, the exact list varies depending on the language/tool/engine). In this case the amount you want (8.571.962,06
) does not have spaces and so "works", but if you want to limit only to line breaks, you can use only \n
.
If you want to be even more specific, you can use (?:\r\n?|\n)
: that considers a \r
(OS X line breaks), a \r
followed by \n
(windows line breaks) or just one \n
(Unix line breaks). Then regex gets thus:
\d{1,3}(?:\.\d{3})*,\d{2}(?=(?:\r\n?|\n)OUTRAS INFORMAÇÕES(?:\r\n?|\n))
This returns only the line before "OTHER INFORMATION", but only if it is a monetary value in the given format (if the number has no dots and the comma, or has the wrong number of digits, the regex finds nothing).
Note that for the line "OTHER INFORMATION" I used a Lookahead (indicated by (?=
). The idea of Lookahead is that it only checks if something exists, but that something is not included in the match. That is why regex finds only the previous line, leaving the "OTHER INFORMATION" itself out of the result.
If the previous row has any more text before the monetary value, the above regex ignore and take only the value.
If you want to be even more specific and only take this value if it is the only information on the line, you can include ^
at the beginning of the regex, in a detail: usually the ^
means "beginning of the string", but several languages and tools have an option that causes it to change its meaning to "beginning of the line". In regexr.com this option is called "multiline" (mark it on the "flags" button in the upper right corner). Each tool has its own way of configuring this, see how it is in what you are using. regex is almost the same as the other, only with a ^
at first:
-- Só funciona com a opção "multiline" ativada, pois o ^ passa a ser "início da linha"
^\d{1,3}(?:\.\d{3})*,\d{2}(?=(?:\r\n?|\n)OUTRAS INFORMAÇÕES(?:\r\n?|\n))
So the regex will only find one match if the line has only the monetary value.
I noticed that in your text there is a space at the end of each line. In this case, just add \s*
(zero or more spaces) in regex, both after monetary value and after "INFORMATION":
\d{1,3}(?:\.\d{3})*,\d{2}(?=\s*(?:\r\n?|\n)OUTRAS INFORMAÇÕES\s*(?:\r\n?|\n))
Finally, some Engines have the shortcut \R
, which corresponds to a line break, so regex could also be:
^\d{1,3}(?:\.\d{3})*,\d{2}(?=\ROUTRAS INFORMAÇÕES\R)
I tested on the Regexr website with the expression [ d. ,]*(?= W in other INFO) and managed to get only the value. I used Positive Lookahead and Character set.
– SylvioT
Please click on [Edit] and add the language/tool you are using, because each one implements regex in one way and what works for one may not work for the other. And I didn’t understand "vulnerable". Anyway, it is not simpler to read the lines one by one (keeping a reference to the previous one) and when the line is "OTHER INFORMATION", you take the previous one and close the loop? Or do you have several lines with "OTHER INFORMATION" and you only want it when it has a specific value above? (in this case, a monetary value)
– hkotsubo
Yes, I close the loop. I get the first value before the word "OTHER INFORMATION" (in this case a line above).
– user2254936