Using the package tabulizer
, I extracted the information only from the first page to test:
library(tabulizer)
library(dplyr)
library(stringi)
url <- 'http://www2.alerj.rj.gov.br/leideacesso/spic/arquivo/folha-de-pagamento-2018-01.pdf'
d <- extract_tables(url, encoding = "UTF-8", pages = 1)
Then I turned the list into data frame, turned it into chr
, named the variables and removed the first line (which is actually the name of the variables)
d <- as.data.frame(d)
d <- d %>%
mutate_all(funs(as.character(.)))
names(d) <- d[1,]
d <- d[-1,]
Then it is necessary to carry out a cleaning in the information, as the thousand separator, the decimal separator that in the pdf is as ,
and turn that information into numeric
d <- d %>%
mutate_all(funs(gsub("-", NA, .)))
d <- d %>%
mutate_at(vars(VENCIMENTO:`TOTAL LÍQUIDO`), funs(gsub("\\.", "", .))) %>%
mutate_at(vars(VENCIMENTO:`TOTAL LÍQUIDO`), funs(as.numeric(gsub(",", "\\.", .))))
If you withdraw the option pages
of function extract_tables
it will pull all the pdf pages and put inside a single list. For the merge into a single table, I think do.call(rbind, d)
will solve.
You can add the code you used to try to extract the information?
– Willian Vieira