PDF to text organizing columns

Asked

Viewed 604 times

7

I am scraping to extract files .pdf, and I need these files as an organized text, since for each line of text in the file there are 3 different columns.

For example in this file, you can see the 3 columns in question. I can read the file as .txt with the following code:

library("rvest")
library("pdftools")

pdf_link <- "http://pesquisa.in.gov.br/imprensa/servlet/INPDFViewer?jornal=1&pagina=3&data=03/04/2017&captchafield=firistAccess"

# Inicia seção e acessa o .pdf
s <- html_session(pdf_link) %>%
  jump_to(pdf_link)

# Salva o arquivo como pdf e depois le
tmp <- tempfile(fileext = '.pdf')
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)

The problem is that each row of the text file, are present the 3 columns separated by spaces, and each row (with the 3 columns) is separated by a \r\n.

What I’d like to do is separate the columns so the text makes sense.

The idea I got is:

  • Separate the lines until the \r\n
  • Separate columns based on the number of spaces (for example: if there is a sequence of 5 consecutive spaces, consider a column).

I’ve never messed with strings and regex, so I’m having a hard time. And I will need to automate this for multiple files, which can cause many errors because of the number of spaces or the array of columns.

If there is any other solution based on the specificities of .pdf it would also be very interesting.

2 answers

6

The @José response is great for the page in question. But try using this algorithm on page 2 or 10 and you’ll see that things get a little out of hand.

This is because not all columns have the same size in DOU (an assumption in @José’s reply). In the case of page 2 the first column has less than 40 lines and the remaining text is divided equally between the two remaining columns or even because the number of elements of doc1 which must be "skipped" - doc1[1:4] - vary.

My approach to this problem so far has been:

  1. Open the *.pdf of DOU in Word and save as *.txt (this can be automated in many ways, but I don’t know any R).

  2. Read the *.txt with the readLines(). In the *.txt created by Word the columns (being one, two or three) are "stacked" so that you can work more easily with the text.

The advantage/disadvantage of this way is that you rely on Microsoft’s algorithm to handle the pdf, which is much better than one that can be created quickly, but escapes our control.

  • Yeah. I went to take the tests this morning and there were some glitches. The problem is that some lines are only spaces (under some title for example), and the algorithm disregards this line, which at the time of joining the columns, is wrong. I found very interesting your idea, I will manually test here and then look for if there is this solution in R or python. Thanks.

  • 1

    The package pyWin32 python can help.

  • Thanks! I took the test here in word and it works very well, it was a great idea.

  • 1

    I edited my answer to use tabulizer. I tested with page 2 and it worked. Also, I simplified the code you used for scraping.

5


See if this helps:

doc1<-unlist(stringr::str_split(doc,"\\s{5,}|\n"))
c1<-paste0(doc1[seq(5,length(doc1),3)],collapse = " ")
c2<-paste0(doc1[seq(6,length(doc1),3)],collapse = " ")
c3<-paste0(doc1[seq(7,length(doc1),3)],collapse = " ")

You can try using the tabulizer package as well. It apparently overcomes the limitations of columns with different sizes:

library(tabulizer)
tmp<-tempfile()

url<-"http://pesquisa.in.gov.br/imprensa/servlet/INPDFViewer?jornal=1&pagina=2&data=03/04/2017&captchafield=firistAccess"

httr::GET(url,write_disk(tmp))

doc<-extract_text(tmp)
  • Very good, thank you

  • 1

    The extract_text already does all the service apparently. Thank you very much

Browser other questions tagged

You are not signed in. Login or sign up in order to post.