A solution for reading a PDF table is to use a two-step approach:
- Extract PDF in plain text
- Interpret plain text with some tool;
Of course this solution is only valid if the PDF file is textual, ie it is not contained by a main image.
With the library pdfbox.jar
it is possible to extract the text by passing the basic parameters "Extracttext". Ex.:
java -jar pdfbox.jar ExtractText C:\CAMINHO\ARQUIVO\PDF\relatorio.pdf c:\Caminho\Arquivo\Saida\Saida.txt
Each row in the PDF will be a line n text file. Each row in the table will also be.
With the output file, you can interpret using Pattern which is a very useful tool for this purpose. This way it is possible to read the PDF files.
Example PDF file:
Text output:
Supervisor:
Lanofu Silva
Date
06/09/2018
ID Product Date Quantity
3 Allen Screw 6" 04/09/2018 300
9 Axis 127 Revest. 04/09/2018 500
15 Profile 3 12x15 05/09/2018 400
72 Metal box 15x15x5 02/09/2018 100
70 Helical gear 1" Nylon 01/09/2048 100
45 Heline Drone 5H-12BR 02/09/2018 130
An example of Patter to recognize each production input of the table:
Pattern padraoLinha = Patter.compile(“\d+\s.+\s\d{2}/\d{2}/\d{4}\s\d+”);
int countEntradaProducao = 0;
For(String linha : Files.readAllBytes(arquivoTexto.toPath())
{
if(padraoLinha.matcher(linha).matchs())
{
countEntradaProducao++;
//Faço alguma coisa
}
}
You can read the PDF as a String
– Marco Souza
I know, but I was wondering if it’s possible to get the table in structured form from a PDF
– Felipe Avelar
If you have how to read the PDF and by some character between each column and at the end of the line. You could use a regex to capture a string pattern
– Marco Souza
I’m not the one who PDF :/
– Felipe Avelar
I don’t know much about it, but you can only get it by string if you find patterns before, during and after the table. But if you have a way for a function to return the table as a matrix would be very good.
– Ícaro Dantas