How to read tables from a pdf?

Asked

Viewed 685 times

0

I am trying to read tables from a certain pdf file, using the iTextSharp, I found many answers that indicate using the LocationTextExtractionStrategy, only that my table may vary position along PDF pages.

Does anyone have any idea how I can solve this problem?

  • You can read the PDF as a String

  • I know, but I was wondering if it’s possible to get the table in structured form from a PDF

  • If you have how to read the PDF and by some character between each column and at the end of the line. You could use a regex to capture a string pattern

  • I’m not the one who PDF :/

  • I don’t know much about it, but you can only get it by string if you find patterns before, during and after the table. But if you have a way for a function to return the table as a matrix would be very good.

1 answer

-1

A solution for reading a PDF table is to use a two-step approach:

  1. Extract PDF in plain text
  2. Interpret plain text with some tool;

Of course this solution is only valid if the PDF file is textual, ie it is not contained by a main image.

With the library pdfbox.jar it is possible to extract the text by passing the basic parameters "Extracttext". Ex.:

java -jar pdfbox.jar ExtractText C:\CAMINHO\ARQUIVO\PDF\relatorio.pdf c:\Caminho\Arquivo\Saida\Saida.txt

Each row in the PDF will be a line n text file. Each row in the table will also be.

With the output file, you can interpret using Pattern which is a very useful tool for this purpose. This way it is possible to read the PDF files.

Example PDF file: inserir a descrição da imagem aqui

Text output:
Supervisor:
Lanofu Silva
Date
06/09/2018
ID Product Date Quantity
3 Allen Screw 6" 04/09/2018 300
9 Axis 127 Revest. 04/09/2018 500
15 Profile 3 12x15 05/09/2018 400
72 Metal box 15x15x5 02/09/2018 100
70 Helical gear 1" Nylon 01/09/2048 100
45 Heline Drone 5H-12BR 02/09/2018 130

An example of Patter to recognize each production input of the table:

Pattern padraoLinha = Patter.compile(“\d+\s.+\s\d{2}/\d{2}/\d{4}\s\d+”);
int countEntradaProducao = 0;
For(String linha : Files.readAllBytes(arquivoTexto.toPath())
{   
    if(padraoLinha.matcher(linha).matchs())
    {
        countEntradaProducao++;
        //Faço alguma coisa
    }
}

Browser other questions tagged

You are not signed in. Login or sign up in order to post.