Regex for numbers of different sizes

Asked

Viewed 64 times

1

I need to read an image . png with a financial handling table, like the one below:

inserir a descrição da imagem aqui

From this table I need to extract the columns gross balance, availabilities, Application/Redemption and Final Gross Balance. Yes, I need to identify the entry and exit of financial movements of each account. I am using the package tesseract to accomplish the ocr and isolate only investment funds.

library(tesseract)
library(stringr)
library(dplyr)
library(tabulizer)
library(tidyverse)
library(purrr)

caminho <- "...\\04 - APR - Processo SEI\\2020\\10 - Outubro\\15-10"

str_detect("FP")
arquivo <- paste(caminho,list.files(caminho)[str_detect(list.files(caminho), "FP")], sep = "/")

txt <- ocr(arquivo)

cnpj <- "\\d{2}[.]\\d{3}[.]\\d{3}[/]\\d{4}[-]\\d{2}"

empilhadoi <- txt %>% map(as_tibble) %>% 
                # empilha
                bind_rows()

empilhado <- as.data.frame(str_split(txt, fixed("\n"))[[1]])

empilhado <- as.data.frame(empilhado[-1,])

My challenge is to separate the values from the columns; saldo bruto, disponibilidades, Aplicacao/Resgate and Saldo Bruto Final. I did the following regex str_extract_all(empilhado$stacked[-1, ], "-?R\\$\\d{0,3}[.?,?]\\d{0,3}[.?,?]\\d{0,3}"), but, it only works for the numbers in the format R$999.999.99. However, my numbers can vary from 0-999.999.99 - and for negative tbm. As in this table we have the -R$23,991.56, it does not detect. I am using the Rstudio with R-language.

1 answer

4


Just make the thousands part optional:

str_extract_all(empilhado$`empilhado[-1, ]`, "-?R\\$\\d{1,3}([.,]\\d{3})*[.,]\\d{2}")

That is, the passage ([.,]\\d{3}) (semicolon followed by exactly 3 digits) can repeat itself zero or more times - the repetition is indicated by quantifier *. So regex takes values less than 1000 too. I left {3} so that they are exactly 3 digits, because {0,3} takes between zero and 3 digits, but I understand that among thousands separators there are always 3 digits.

And as the greatest value is 999.999.999, you could change the * for {0,2} (since the sequence "semicolon + 3 digits" can be repeated no more than 2 times).

And for the first digit after the "R$" I changed to \\d{1,3} (between 1 and 3 digits), because I think there should be at least one.

On the pennies I traded for \\d{2}, because there I understand that they can only have exactly 2 digits.

I swapped the comma or dot just for [.,], because placing the question inside the brackets causes the regex to also pick up the character ?. If the idea was to make the point or comma optional, just switch to [.,]?.


Another detail is that this expression accepts values such as "R$000,000.00". If not to accept these cases, you can switch to:

str_extract_all(x, "\\b-?R\\$(0|[1-9]\\d{0,2}([.,]\\d{3}){0,2})[.,]\\d{2}\\b")

I use alternation (the character |, meaning "or"). Thus, before the pennies can have only a zero, or a digit from 1 to 9 ([1-9]) followed by zero to 2 digits (ensuring they can have values from 1 to 999).

I also put \\b to ensure that before and after there are no other alphanumeric characters (i.e., to ensure that values are "isolated" in the text). Read here to better understand.

  • Vlw, I’ll study your regex to see if I understand what you’ve done!

  • @Flaviosilva I made a change to the pennies and also put another option to better limit the values

Browser other questions tagged

You are not signed in. Login or sign up in order to post.