How to read PDF data in R?

Asked

Viewed 2,787 times

5

I have numerous PDF files containing water well reports from CPRM, like this:

http://siagasweb.cprm.gov.br/layout/pdf/exportar_pdf.php?ponto=4300000556

Inside these files are information about the lithology of the soil of each well, as can be observed in the image below.

inserir a descrição da imagem aqui

The table of lithology is on the second sheet of the file and varies according to the characteristics of the well. I need to group such information from these files in a single location, thought a dataframe.

How can I read PDF files and then group this information using R?

  • 1

    PDF files are complicated. If you use pdftotext, for example, the information gets messy. Why not read the files directly from the web, without exporting? It seems easier to extract from the "Geological" tab than from the PDF, at: http://siagasweb.cprm.gov.br/layout/detailedphp?ponto=4300000556#tabs-3

1 answer

7


I’ll give you an incomplete answer because I’m running out of time, but I think it might help.

Someone can then edit by adding the last step.

You can use the package extractr. Read the installation instruction here: https://github.com/sckott/extractr.

This package uses a number of API’s available on the internet to convert a PDF to text.

For your pdf, I did so:

1) I saved on my desktop and called the function:

library(extractr)
xpdf <- extract("Desktop/doc.pdf", "xpdf")

2) I have separated the part of the text in which are the data you need using substrings.

> lito <- str_locate(xpdf$data, "Litológicos") #procura o fim de litologicos
> hidro <- str_locate(xpdf$data, "Hidrogeológicos") # procura o início de hidrogeologicos
> dados <- str_sub(xpdf$data, start = lito[2] + 4, end = hidro[1]- 5)
> dados
[1] "De (m):, , Até (m):, , Litologia:, , Descrição Litológica:, , 0, , 3, , Arenito fino, , SOLO E ARENITO FINO A MUITO FINO, QUARTZOSO, ESBRANQUICADO, MUITO POUCO ARGILOSO, , 3, , 13, , Arenito fino, , ARENITO FINO A MUITO FINO, AVERMELHADO, MODERADAMENTE ARGILOSO, , 13, , 21, , Arenito argiloso, , ARENITO FINO A MUITO FINO, ESBRANQUICADO, MUITO POUCO ARGILOSO, CONCENTRACOES LOCALIZADAS, , 21, , 42, , Arenito fino, , ARENITO FINO A MUITO FINO, ESBRANQUICADO A ROSADO, MODERADAMENTE ARGILOSO, , 42, , 55, , Arenito fino, , ARENITO FINO A MUITO FINO, ESBRANQUICADO A ROSADO, POUCO ARGILOSO, , 55, , 60, , Arenito fino, , ARENITO FINO A MUITO FINO, COM TONS AVERMELHADOS, FORTEMENTE ARGILOSO, , 60, , 63, , Arenito fino, , ARENITO FINO A MUITO FINO, TONALIDADE ROSEA, MODERADAMENTE ARGILOSO, , 63, , 70, , Arenito fino, , ARENITO FINO A MUITO FINO, TONALIDADE ROSEA, MODERADAMENTE ARGILOSO, , 70, , 76, , Arenito fino, , ARENITO FINO A MUITO FINO, TONALIDADE ROSEA, POUCO ARGILOSO, , 76, , 87, , Arenito fino, , ARENITO FINO A MUITO FINO, TONALIDADE ROSEA, MODERADAMENTE ARGILOSO, , 87, , 102, , Arenito fino, , ARENITO FINO A MUITO FINO, TONALIDADE ROSEA, POUCO ARGILOSO"

Now, what you need to try is to convert this string into a data.frame.

Anyway, that’s one way... But as Molx said pdfs are always complicated, I think the best way would be to try to extract from the same web page.

  • 2

    That one extractr seems to work relatively well. You can turn it into a dataframe with the following code: df <- as.data.frame(matrix(unlist(strsplit(x = dados, split = ", , ")), ncol = 4, byrow = TRUE), stringsAsFactors = FALSE);&#xA;colnames(df) <- df[1,];&#xA;df <- df[-1,];&#xA;df[,1] <- as.numeric(df[,1]);&#xA;df[,2] <- as.numeric(df[,2])&#But I still find a little "risky", the standard ", , " worked well but may go wrong for another pdf, messing up everything.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.