Pdf reading via R

Asked

Viewed 524 times

4

  • You can add the code you used to try to extract the information?

1 answer

4


Using the package tabulizer, I extracted the information only from the first page to test:

library(tabulizer)
library(dplyr)
library(stringi)
url <- 'http://www2.alerj.rj.gov.br/leideacesso/spic/arquivo/folha-de-pagamento-2018-01.pdf'
d <- extract_tables(url, encoding = "UTF-8", pages = 1)

Then I turned the list into data frame, turned it into chr, named the variables and removed the first line (which is actually the name of the variables)

d <- as.data.frame(d)
d <- d %>% 
  mutate_all(funs(as.character(.)))
names(d) <- d[1,]
d <- d[-1,]

Then it is necessary to carry out a cleaning in the information, as the thousand separator, the decimal separator that in the pdf is as , and turn that information into numeric

d <- d %>% 
  mutate_all(funs(gsub("-", NA, .)))
d <- d %>% 
  mutate_at(vars(VENCIMENTO:`TOTAL LÍQUIDO`), funs(gsub("\\.", "", .))) %>% 
  mutate_at(vars(VENCIMENTO:`TOTAL LÍQUIDO`), funs(as.numeric(gsub(",", "\\.", .))))

If you withdraw the option pages of function extract_tables it will pull all the pdf pages and put inside a single list. For the merge into a single table, I think do.call(rbind, d) will solve.

  • When trying to install the tabulizer had the following error: > devtools::install_github("leeper/tabulizer")&#xA;Installation failed: Failed to connect to raw.githubusercontent.com port 443: Connection refused I’m in a corporate network with firewall and proxy and I’m not sure I could ask for an exception. What bothers me is that even doing the dwonload of . zip and asking to install Rstudio does not give me any error message (or success) in the installation and I cannot load the package, states that it does not exist.

  • 1

    @Flaviosilva a while ago I had a similar problem and searching the internet, I came to this solution. set_config( config( ssl_verifypeer = 0L ) ) See if it works for you.

  • It worked perfectly the solution. I gave up trying to do at work and did on my home PC!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.