Reading PDF in R

Asked

Viewed 549 times

2

I am doing a job for college and would like to get the income and audience of each game of the Brazilian championship of recent years. CBF graduates into a series of links, an example is the Borderline. For other similar problems I use the package tabulizer, as in the code below

library(tabulizer)
url <- 'https://conteudo.cbf.com.br/sumulas/2014/1421b.pdf'
d <- extract_tables(url, encoding = "UTF-8")

For tables created in PDF it works perfectly, but for this type of pdf (which was probably printed, scanned and then saved in pdf) does not work, the code returns a list with 0 elements. Any ideas or packages I can use?

  • 1

    @Flavio Silva in this case the problem is not to extract data from a pdf, but to extract data from the image. Note that there is no structure in this pdf, only the image. You need some program that extracts text from images.

1 answer

3

The table in the PDF, is an image. This R package searches for textual elements, it returns an empty list precisely because there is no text in the file. You need techniques that do text recognition on image, I suggest you look for OCR, which is a process that extracts text from a given image.

In R, there is the Tesseract package, which performs this operation. Follow a tutorial link of the R Tesseract package, which extracts image text.

https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html

In this part of the tutorial it shows how to extract a PDF

https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html#read_from_pdf_files

  • I will read, thank you very much for your reply!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.