Regex take value from previous line

Asked

Viewed 255 times

1

Text the text below and would like to pick up the value of the line above the word "OUTRAS INFORMAÇÕES", in this case the value 8.571.962,06.

I did it this way, but I’m finding it very vulnerable:

^(.*?)\s*(?<valor>\d+.\d+.\d+,\d+|\d+.\d+.\d+,\d+|\d+.\d+,\d+|\d+,\d+))\s*OUTRAS INFORMA.*?.*?ES

Text:

NOME: TESTE DE SILVA SAURO
CPF: 785.981.970-84
DECLARAÇÃO DE AJUSTE ANUAL
IMPOSTO SOBRE A RENDA - PESSOA FÍSICA
EXERCICIO 2018 ANO-CALENDÁRIO 2017
EVOLUÇÃO PATRIMONIAL
Bens e direitos em 31/12/2016
Bens e direitos em 31/12/2017
Dividas conus rcais em 31/12/2016
Divisas e ônus reais em 31/12/2017
100.580.873.91
100.329. 110,32
9135,456,07
8.571.962,06
OUTRAS INFORMAÇÕES
Rendimentos isentos e não tributáveis

I’m using the site regex. and the program Rad Software Regular Expression Designer.

  • 1

    I tested on the Regexr website with the expression [ d. ,]*(?= W in other INFO) and managed to get only the value. I used Positive Lookahead and Character set.

  • 1

    Please click on [Edit] and add the language/tool you are using, because each one implements regex in one way and what works for one may not work for the other. And I didn’t understand "vulnerable". Anyway, it is not simpler to read the lines one by one (keeping a reference to the previous one) and when the line is "OTHER INFORMATION", you take the previous one and close the loop? Or do you have several lines with "OTHER INFORMATION" and you only want it when it has a specific value above? (in this case, a monetary value)

  • Yes, I close the loop. I get the first value before the word "OTHER INFORMATION" (in this case a line above).

2 answers

2


I don’t know in detail what software you are using, but how you are testing on the site regex., I’ll assume the resources PCRE are available. (basically there are several "Flavors" different from regex and there is a lot of variation between languages and tools, so I leave below a more general solution, which should work for your case)


I do not understand why you find your regex "vulnerable". Maybe it is not the most efficient, since it has several alternations (the character |, which causes it to test each of the alternatives until it finds one) and many of them are somewhat redundant (you test several different combinations of numbers separated by colons and commas).

The number you want to take seems to be a monetary value, always using the point as the separator of the thousands and the comma to separate the pennies. So you can use:

\d{1,3}(?:\.\d{3})*,\d{2}

You were using \d+, which means "one or more digits", so \d+,\d+ accepts values as 12345,12345. Using quantifiers more specific it is possible to limit the exact quantities we need.

In the case, \d{1,3} is "at least 1 and at most 3 digits". Next I have the section in parentheses (?:\.\d{3}), which is "point, followed by exactly 3 digits". And just after the parentheses we have *, which means "zero or more occurrences". This means that all this "point followed by 3 digits" can be repeated several times (or none at all). This ensures that values such as 1,23, 123,12, 1.212,21 and 8.571.962,06.

Then you used it \s, that really picks up the line breaks, but also picks up spaces and the character TAB (among others, the exact list varies depending on the language/tool/engine). In this case the amount you want (8.571.962,06) does not have spaces and so "works", but if you want to limit only to line breaks, you can use only \n.

If you want to be even more specific, you can use (?:\r\n?|\n): that considers a \r (OS X line breaks), a \r followed by \n (windows line breaks) or just one \n (Unix line breaks). Then regex gets thus:

\d{1,3}(?:\.\d{3})*,\d{2}(?=(?:\r\n?|\n)OUTRAS INFORMAÇÕES(?:\r\n?|\n))

This returns only the line before "OTHER INFORMATION", but only if it is a monetary value in the given format (if the number has no dots and the comma, or has the wrong number of digits, the regex finds nothing).

Note that for the line "OTHER INFORMATION" I used a Lookahead (indicated by (?=). The idea of Lookahead is that it only checks if something exists, but that something is not included in the match. That is why regex finds only the previous line, leaving the "OTHER INFORMATION" itself out of the result.


If the previous row has any more text before the monetary value, the above regex ignore and take only the value.

If you want to be even more specific and only take this value if it is the only information on the line, you can include ^ at the beginning of the regex, in a detail: usually the ^ means "beginning of the string", but several languages and tools have an option that causes it to change its meaning to "beginning of the line". In regexr.com this option is called "multiline" (mark it on the "flags" button in the upper right corner). Each tool has its own way of configuring this, see how it is in what you are using. regex is almost the same as the other, only with a ^ at first:

-- Só funciona com a opção "multiline" ativada, pois o ^ passa a ser "início da linha"
^\d{1,3}(?:\.\d{3})*,\d{2}(?=(?:\r\n?|\n)OUTRAS INFORMAÇÕES(?:\r\n?|\n))

So the regex will only find one match if the line has only the monetary value.


I noticed that in your text there is a space at the end of each line. In this case, just add \s* (zero or more spaces) in regex, both after monetary value and after "INFORMATION":

\d{1,3}(?:\.\d{3})*,\d{2}(?=\s*(?:\r\n?|\n)OUTRAS INFORMAÇÕES\s*(?:\r\n?|\n))

Finally, some Engines have the shortcut \R, which corresponds to a line break, so regex could also be:

^\d{1,3}(?:\.\d{3})*,\d{2}(?=\ROUTRAS INFORMAÇÕES\R)
  • to get the value 3 lines above, as it would look? 100.329.110,32

  • 1

    @user2254936 Just add (?:(?:\r\n?|\n).*){2} to pick up 2 lines before "OTHER INFORMATION": https://regexr.com/4qaj1 - just one tip: if you have a different question, ideally ask another question (searching before if it no longer exists on the site, of course), because then the question is visible to everyone on the main page and you have more chances to answer. Commenting here, fewer people see (probably just me), and lately I’m not enough time to dedicate to the site, IE, you have less chance to get a quick response...

0

Bo test regerx and it seemed to work (.+)\nOUTRAS INFORMAÇÕES

In python

import re
str = """
NOME: TESTE DE SILVA SAURO
CPF: 785.981.970-84
DECLARAÇÃO DE AJUSTE ANUAL
IMPOSTO SOBRE A RENDA - PESSOA FÍSICA
EXERCICIO 2018 ANO-CALENDÁRIO 2017
EVOLUÇÃO PATRIMONIAL
Bens e direitos em 31/12/2016
Bens e direitos em 31/12/2017
Dividas conus rcais em 31/12/2016
Divisas e ônus reais em 31/12/2017
100.580.873.91
100.329. 110,32
9135,456,07
8.571.962,06
OUTRAS INFORMAÇÕES
Rendimentos isentos e não tributáveis 
"""

program = re.compile('(.+)\nOUTRAS INFORMAÇÕES')
program.findall(str)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.