2
Hello,
Have a pdf containing a table and I want to extract this table to be able to analyze in R. I am using the tabulizer::extract_tables() .
As the table occupies more than one page, it returns me an object in list format (out) containing 12 elements. In theory, each element is a part of the table. The table I want is in the elements out[[1]]
until out[[4]]
The problem is that since the table has no header on all pages and, I imagine, also because of the document header, the function cannot delimit the columns. The element out[[1]]
has 4 columns, out[[2]]
and out[[3]]
have 2 columns and out[[4]]
has only one column. Is there any way at least I can return these four elements with four columns?
Code:
library(tabulizer)
arquivo <- "1236_Pombos_PE.pdf"
out <- extract_tables(arquivo, output = "data.frame", encoding = "UTF-8")
Tried to control using the argument
pages
? It may improve the parse of the tables– Tomás Barcellos
Hi Tomás, I would like to use this function in a loop with 3 thousand files, so I was looking for a more automated solution.
– Jessica Voigt
The problem seems to be in the quality of the PDF. Then things get complicated. You can ask to see when pages docto has and then use this information to parse each page separately (I don’t know if this will help in quality).
– Tomás Barcellos
But the problem is how to transform 2-column tables into 4-column tables. If you know send a
function(x) if(ncol(x) == 4) return(x) else transformacao_necessaria(x)
. Then after that it’s just pile up.– Tomás Barcellos
You’re right about the quality of the PDF. I’m making "by hand" some documents to see if I can standardize the type of error, but in the same pdf each item on my list appears in a different way, doesn’t seem to have much pattern no =/
– Jessica Voigt
Too bad. PDF is really tricky
– Tomás Barcellos
An outburst: I still see institutions say that they are supporters of "open data" just by providing Pdfs...
– Ailton Andrade de Oliveira