Extract in R tables that take up more than one page in pdf files

Question

Extract in R tables that take up more than one page in pdf files

Asked 6 years, 5 months ago

Viewed 122 times

2

Hello,

Have a pdf containing a table and I want to extract this table to be able to analyze in R. I am using the tabulizer::extract_tables() .

As the table occupies more than one page, it returns me an object in list format (out) containing 12 elements. In theory, each element is a part of the table. The table I want is in the elements out[[1]] until out[[4]]

The problem is that since the table has no header on all pages and, I imagine, also because of the document header, the function cannot delimit the columns. The element out[[1]] has 4 columns, out[[2]] and out[[3]] have 2 columns and out[[4]] has only one column. Is there any way at least I can return these four elements with four columns?

Code:

library(tabulizer)

arquivo <- "1236_Pombos_PE.pdf"
out <- extract_tables(arquivo, output = "data.frame", encoding = "UTF-8")

Tried to control using the argument pages? It may improve the parse of the tables

– Tomás Barcellos

2019/02/27 at 20:05
Hi Tomás, I would like to use this function in a loop with 3 thousand files, so I was looking for a more automated solution.

– Jessica Voigt

2019/02/28 at 14:32
1

The problem seems to be in the quality of the PDF. Then things get complicated. You can ask to see when pages docto has and then use this information to parse each page separately (I don’t know if this will help in quality).

– Tomás Barcellos

2019/02/28 at 16:09
1

But the problem is how to transform 2-column tables into 4-column tables. If you know send a function(x) if(ncol(x) == 4) return(x) else transformacao_necessaria(x). Then after that it’s just pile up.

– Tomás Barcellos

2019/02/28 at 16:11
You’re right about the quality of the PDF. I’m making "by hand" some documents to see if I can standardize the type of error, but in the same pdf each item on my list appears in a different way, doesn’t seem to have much pattern no =/

– Jessica Voigt

2019/02/28 at 17:58
Too bad. PDF is really tricky

– Tomás Barcellos

2019/02/28 at 20:18
2

An outburst: I still see institutions say that they are supporters of "open data" just by providing Pdfs...

– Ailton Andrade de Oliveira

2019/03/01 at 12:12

Show 2 more comments

No answers

Browser other questions tagged r

You are not signed in. Login or sign up in order to post.