How to read tables from a pdf?

Question

How to read tables from a pdf?

Asked 8 years, 10 months ago

Viewed 685 times

0

I am trying to read tables from a certain pdf file, using the iTextSharp, I found many answers that indicate using the LocationTextExtractionStrategy, only that my table may vary position along PDF pages.

Does anyone have any idea how I can solve this problem?

You can read the PDF as a String

– Marco Souza

2016/08/28 at 19:37
I know, but I was wondering if it’s possible to get the table in structured form from a PDF

– Felipe Avelar

2016/08/28 at 20:01
If you have how to read the PDF and by some character between each column and at the end of the line. You could use a regex to capture a string pattern

– Marco Souza

2016/08/28 at 20:07
I’m not the one who PDF :/

– Felipe Avelar

2016/08/28 at 21:53
I don’t know much about it, but you can only get it by string if you find patterns before, during and after the table. But if you have a way for a function to return the table as a matrix would be very good.

– Ícaro Dantas

2016/09/02 at 08:02

1 answer

Browser other questions tagged c# itextsharp

You are not signed in. Login or sign up in order to post.

by Desenvolvedor 1RISJC • 7 points · Answer 1 · 2018-09-06T15:10:45+00:00

A solution for reading a PDF table is to use a two-step approach:

Extract PDF in plain text
Interpret plain text with some tool;

Of course this solution is only valid if the PDF file is textual, ie it is not contained by a main image.

With the library pdfbox.jar it is possible to extract the text by passing the basic parameters "Extracttext". Ex.:

java -jar pdfbox.jar ExtractText C:\CAMINHO\ARQUIVO\PDF\relatorio.pdf c:\Caminho\Arquivo\Saida\Saida.txt

Each row in the PDF will be a line n text file. Each row in the table will also be.

With the output file, you can interpret using Pattern which is a very useful tool for this purpose. This way it is possible to read the PDF files.

Example PDF file:

Text output:
Supervisor:
Lanofu Silva
Date
06/09/2018
ID Product Date Quantity
3 Allen Screw 6" 04/09/2018 300
9 Axis 127 Revest. 04/09/2018 500
15 Profile 3 12x15 05/09/2018 400
72 Metal box 15x15x5 02/09/2018 100
70 Helical gear 1" Nylon 01/09/2048 100
45 Heline Drone 5H-12BR 02/09/2018 130

An example of Patter to recognize each production input of the table:

Pattern padraoLinha = Patter.compile(“\d+\s.+\s\d{2}/\d{2}/\d{4}\s\d+”);
int countEntradaProducao = 0;
For(String linha : Files.readAllBytes(arquivoTexto.toPath())
{   
    if(padraoLinha.matcher(linha).matchs())
    {
        countEntradaProducao++;
        //Faço alguma coisa
    }
}