7
I am scraping to extract files .pdf
, and I need these files as an organized text, since for each line of text in the file there are 3 different columns.
For example in this file, you can see the 3 columns in question.
I can read the file as .txt
with the following code:
library("rvest")
library("pdftools")
pdf_link <- "http://pesquisa.in.gov.br/imprensa/servlet/INPDFViewer?jornal=1&pagina=3&data=03/04/2017&captchafield=firistAccess"
# Inicia seção e acessa o .pdf
s <- html_session(pdf_link) %>%
jump_to(pdf_link)
# Salva o arquivo como pdf e depois le
tmp <- tempfile(fileext = '.pdf')
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)
The problem is that each row of the text file, are present the 3 columns separated by spaces, and each row (with the 3 columns) is separated by a \r\n
.
What I’d like to do is separate the columns so the text makes sense.
The idea I got is:
- Separate the lines until the
\r\n
- Separate columns based on the number of spaces (for example: if there is a sequence of 5 consecutive spaces, consider a column).
I’ve never messed with strings and regex, so I’m having a hard time. And I will need to automate this for multiple files, which can cause many errors because of the number of spaces or the array of columns.
If there is any other solution based on the specificities of .pdf
it would also be very interesting.
Yeah. I went to take the tests this morning and there were some glitches. The problem is that some lines are only spaces (under some title for example), and the algorithm disregards this line, which at the time of joining the columns, is wrong. I found very interesting your idea, I will manually test here and then look for if there is this solution in R or python. Thanks.
– TheBiro
The package pyWin32 python can help.
– Tomás Barcellos
Thanks! I took the test here in word and it works very well, it was a great idea.
– TheBiro
I edited my answer to use tabulizer. I tested with page 2 and it worked. Also, I simplified the code you used for scraping.
– José