Get value 3 lines after given word

Asked

Viewed 42 times

1

I have the OCR below and I need to get the value 254.878,00, in the regex I did is picking the value 8.571.962,06. By the word OUTRAS INFORMAÇÕES would be ideal because the text is bold and always has a good quality after OCR extraction.

9135,456,07  
8.571.962,06  
OUTRAS INFORMAÇÕES  
Rendimentos isentos e não tributáveis  
0.00  
254.878,00  
0,00  
Rendimentos sujeitos a tributação exclusiva/definitiva  
Rendimentos tributáveis - imposto con exigibilidade suspensa

REGEX:

(?<valor>\d{1,3}(?:\.\d{3})*,\d{2})(?=\s*(?:\r\n?|\n)OUTRAS INFORMA.*?.*?ES\s*(?:\r\n?|\n))

1 answer

3


In your regex you have placed the "OTHER INFORMATION" section after the number, but as the number you want is after, then you must invert and put "OTHER INFORMATION" before:

OUTRAS INFORMA..ES.*(?:\r\n?|\n)(?:.+(?:\r\n?|\n)){2}(\d{1,3}(?:\.\d{3}),\d{2})

After "OTHER INFORMATION", I put .*(?:\r\n?|\n) (zero or more characters and the line break).

Then we have (?:.+(?:\r\n?|\n)){2}:

  • .+ is "one or more characters". But as the default point is not corresponds to line breaks, I guarantee that it only goes until the end of the line (but if you want to be more explicit, you can also exchange for [^\n\r]+ - anything that is not \n nor \r).
  • then I take the line break: a \r whether or not followed by a \n, or a \n alone (thus contemplating the line breaks of Windows, Macos and Unix)
  • all this repeats twice, ensuring that I will skip two lines after "OTHER INFORMATION" (the quantifier {2} indicates that the phrase "multiple characters + line break" repeats several times, then in this case it indicates the number of lines to be skipped)

Then I take the number. As it is in parentheses, it forms a catch group, and so will be in the first capture group (the other parentheses start with (?: and with it they do not form capture groups).


Some Engines support the shortcut \R, which corresponds to a line break (be the \n or \r alone, or the sequence \r\n, among others - the complete list varies according to the language). So the regex could also be:

OUTRAS INFORMA..ES.*\R(?:.+\R){2}(\d{1,3}(?:\.\d{3}),\d{2})
  • 1

    Excellent, it worked out here. Thank you very much

  • If I were to take the value 0,00 as I would to skip one more line? @hkotsubo

  • @user2254936 In the excerpt (?:.+(?:\r\n?|\n)){2}, the number 2 indicates that will skip two lines. So just switch by the amount of lines you want to jump

  • did not :( I am tested on https://regex101.com/

  • 1

    @user2254936 Ah, it is because the regex that takes the number requires the part \.\d{3} exist. Just leave it optional, like this: (\d{1,3}(?:\.\d{3})*,\d{2}) - see

  • Now yes, thank you very much

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.