Extract in R tables that take up more than one page in pdf files

Asked

Viewed 122 times

2

Hello,

Have a pdf containing a table and I want to extract this table to be able to analyze in R. I am using the tabulizer::extract_tables() .

As the table occupies more than one page, it returns me an object in list format (out) containing 12 elements. In theory, each element is a part of the table. The table I want is in the elements out[[1]] until out[[4]]

The problem is that since the table has no header on all pages and, I imagine, also because of the document header, the function cannot delimit the columns. The element out[[1]] has 4 columns, out[[2]] and out[[3]] have 2 columns and out[[4]] has only one column. Is there any way at least I can return these four elements with four columns?

Code:

library(tabulizer)

arquivo <- "1236_Pombos_PE.pdf"
out <- extract_tables(arquivo, output = "data.frame", encoding = "UTF-8")
  • Tried to control using the argument pages? It may improve the parse of the tables

  • Hi Tomás, I would like to use this function in a loop with 3 thousand files, so I was looking for a more automated solution.

  • 1

    The problem seems to be in the quality of the PDF. Then things get complicated. You can ask to see when pages docto has and then use this information to parse each page separately (I don’t know if this will help in quality).

  • 1

    But the problem is how to transform 2-column tables into 4-column tables. If you know send a function(x) if(ncol(x) == 4) return(x) else transformacao_necessaria(x). Then after that it’s just pile up.

  • You’re right about the quality of the PDF. I’m making "by hand" some documents to see if I can standardize the type of error, but in the same pdf each item on my list appears in a different way, doesn’t seem to have much pattern no =/

  • Too bad. PDF is really tricky

  • 2

    An outburst: I still see institutions say that they are supporters of "open data" just by providing Pdfs...

Show 2 more comments
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.