Pdf reading via R

Question

Pdf reading via R

Asked 7 years, 4 months ago

Viewed 524 times

4

I need to convert the PDF data below into a data frame: http://www2.alerj.rj.gov.br/leideacesso/verArquivo.asp?idArquivo=635

Doing a search for found the How to Read PDF Data in R. I had some problems to install the package, but I managed to make it work on Rstudio. But the result was not satisfactory, because in the columns with 3 or more blank lines it already jumps to another column.

You can add the code you used to try to extract the information?

– Willian Vieira

2018/03/19 at 20:41

1 answer

Browser other questions tagged r pdf

You are not signed in. Login or sign up in order to post.

by Rafael Cunha • **4,954** points · Answer 1 · 2018-03-19T20:49:13+00:00

Using the package tabulizer, I extracted the information only from the first page to test:

library(tabulizer)
library(dplyr)
library(stringi)
url <- 'http://www2.alerj.rj.gov.br/leideacesso/spic/arquivo/folha-de-pagamento-2018-01.pdf'
d <- extract_tables(url, encoding = "UTF-8", pages = 1)

Then I turned the list into data frame, turned it into chr, named the variables and removed the first line (which is actually the name of the variables)

d <- as.data.frame(d)
d <- d %>% 
  mutate_all(funs(as.character(.)))
names(d) <- d[1,]
d <- d[-1,]

Then it is necessary to carry out a cleaning in the information, as the thousand separator, the decimal separator that in the pdf is as , and turn that information into numeric

d <- d %>% 
  mutate_all(funs(gsub("-", NA, .)))
d <- d %>% 
  mutate_at(vars(VENCIMENTO:`TOTAL LÍQUIDO`), funs(gsub("\\.", "", .))) %>% 
  mutate_at(vars(VENCIMENTO:`TOTAL LÍQUIDO`), funs(as.numeric(gsub(",", "\\.", .))))

If you withdraw the option pages of function extract_tables it will pull all the pdf pages and put inside a single list. For the merge into a single table, I think do.call(rbind, d) will solve.