1
I need to read an image . png with a financial handling table, like the one below:
From this table I need to extract the columns gross balance, availabilities, Application/Redemption and Final Gross Balance. Yes, I need to identify the entry and exit of financial movements of each account. I am using the package tesseract
to accomplish the ocr
and isolate only investment funds.
library(tesseract)
library(stringr)
library(dplyr)
library(tabulizer)
library(tidyverse)
library(purrr)
caminho <- "...\\04 - APR - Processo SEI\\2020\\10 - Outubro\\15-10"
str_detect("FP")
arquivo <- paste(caminho,list.files(caminho)[str_detect(list.files(caminho), "FP")], sep = "/")
txt <- ocr(arquivo)
cnpj <- "\\d{2}[.]\\d{3}[.]\\d{3}[/]\\d{4}[-]\\d{2}"
empilhadoi <- txt %>% map(as_tibble) %>%
# empilha
bind_rows()
empilhado <- as.data.frame(str_split(txt, fixed("\n"))[[1]])
empilhado <- as.data.frame(empilhado[-1,])
My challenge is to separate the values from the columns; saldo bruto
, disponibilidades
, Aplicacao/Resgate
and Saldo Bruto Final
. I did the following regex str_extract_all(empilhado$
stacked[-1, ], "-?R\\$\\d{0,3}[.?,?]\\d{0,3}[.?,?]\\d{0,3}")
, but, it only works for the numbers in the format R$999.999.99. However, my numbers can vary from 0-999.999.99 - and for negative tbm. As in this table we have the -R$23,991.56, it does not detect. I am using the Rstudio
with R-language.
Vlw, I’ll study your regex to see if I understand what you’ve done!
– Flavio Silva
@Flaviosilva I made a change to the pennies and also put another option to better limit the values
– hkotsubo