PDF to text organizing columns

Question

PDF to text organizing columns

Asked 8 years, 3 months ago

Viewed 604 times

7

I am scraping to extract files .pdf, and I need these files as an organized text, since for each line of text in the file there are 3 different columns.

For example in this file, you can see the 3 columns in question. I can read the file as .txt with the following code:

library("rvest")
library("pdftools")

pdf_link <- "http://pesquisa.in.gov.br/imprensa/servlet/INPDFViewer?jornal=1&pagina=3&data=03/04/2017&captchafield=firistAccess"

# Inicia seção e acessa o .pdf
s <- html_session(pdf_link) %>%
  jump_to(pdf_link)

# Salva o arquivo como pdf e depois le
tmp <- tempfile(fileext = '.pdf')
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)

The problem is that each row of the text file, are present the 3 columns separated by spaces, and each row (with the 3 columns) is separated by a \r\n.

What I’d like to do is separate the columns so the text makes sense.

The idea I got is:

Separate the lines until the \r\n
Separate columns based on the number of spaces (for example: if there is a sequence of 5 consecutive spaces, consider a column).

I’ve never messed with strings and regex, so I’m having a hard time. And I will need to automate this for multiple files, which can cause many errors because of the number of spaces or the array of columns.

If there is any other solution based on the specificities of .pdf it would also be very interesting.

2 answers

5

See if this helps:

doc1<-unlist(stringr::str_split(doc,"\\s{5,}|\n"))
c1<-paste0(doc1[seq(5,length(doc1),3)],collapse = " ")
c2<-paste0(doc1[seq(6,length(doc1),3)],collapse = " ")
c3<-paste0(doc1[seq(7,length(doc1),3)],collapse = " ")

You can try using the tabulizer package as well. It apparently overcomes the limitations of columns with different sizes:

library(tabulizer)
tmp<-tempfile()

url<-"http://pesquisa.in.gov.br/imprensa/servlet/INPDFViewer?jornal=1&pagina=2&data=03/04/2017&captchafield=firistAccess"

httr::GET(url,write_disk(tmp))

doc<-extract_text(tmp)

Very good, thank you

– TheBiro

2017/04/25 at 02:11
1

The extract_text already does all the service apparently. Thank you very much

– TheBiro

2017/04/25 at 16:57

Browser other questions tagged r pdf

You are not signed in. Login or sign up in order to post.

by Tomás Barcellos • **5,562** points · Answer 1 · 2017-04-25T13:33:21+00:00

The @José response is great for the page in question. But try using this algorithm on page 2 or 10 and you’ll see that things get a little out of hand.

This is because not all columns have the same size in DOU (an assumption in @José’s reply). In the case of page 2 the first column has less than 40 lines and the remaining text is divided equally between the two remaining columns or even because the number of elements of doc1 which must be "skipped" - doc1[1:4] - vary.

My approach to this problem so far has been:

Open the *.pdf of DOU in Word and save as *.txt (this can be automated in many ways, but I don’t know any R).
Read the *.txt with the readLines(). In the *.txt created by Word the columns (being one, two or three) are "stacked" so that you can work more easily with the text.

The advantage/disadvantage of this way is that you rely on Microsoft’s algorithm to handle the pdf, which is much better than one that can be created quickly, but escapes our control.