PDF reading . NET

Asked

Viewed 3,348 times

2

I have been researching for a long time a way to read a pdf document that contains Sinapi Inputs tables and save the data in my database, and I do not do the least how to, someone could give a hint?

pdf link here

more complex pdf link here

1 answer

3

It is even possible to read, however, it is only feasible if the PDF maintains a "clean" format (with well defined rows and columns, no multiline, etc.). Even so a change in layout can break all the code done for reading the PDF.

In most cases a viable solution would be to transform the PDF into another format: HTML, TXT, Xls, etc.

Here has a good online tool for PDF to HTML conversion that would make it easy to read in various languages (including C#). See an example of how your document would look:

Document converted to HTML:

Documento convertido em HTML

Since the document does not have tables with defined default, the conversion makes HTML difficult to read, for example with the Htmlagilitypack

One of the tools for converting PDF into a "readable" format for a programming language is the Able2extract

See the settings and how your document was converted to XLS:

It is the best option for conversion because it allows you to align/select only the required text


Setup: Select only table and columns for conversion Configuracao

A free tool to extract PDF data: PDF Multitool Utility inserir a descrição da imagem aqui

Converted table, now just create code for reading the XLS

Surely the code to read XLS is much more practical than for PDF inserir a descrição da imagem aqui

string con = @"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=D:\temp\test.xls;Extended Properties='Excel 8.0;HDR=Yes;'"
using(OleDbConnection connection = new OleDbConnection(con))
{
    connection.Open();
    OleDbCommand command = new OleDbCommand("select * from [Sheet1$]", connection) 
    using(OleDbDataReader dr = command.ExecuteReader())
    {
         while(dr.Read())
         {
             var row1Col0 = dr[0];
             Console.WriteLine(row1Col0);
         }
    }
}

Some of the various examples available on the WEB: Here and Here

  • 1

    Very good, thanks for the tip! , My main problem is that I need the person to enter the pdf in the system because it is updated every month, so it has to be a very simple procedure.

  • You’re welcome @Ronaldoasevedo mark the answer as aceita if it is satisfactory.

  • 1

    My main problem is that I need the person to enter the pdf in the system because it is updated every month, so it has to be a very simple procedure.

  • I understand, reading PDF is not something simple due to the complexity of it, see your case. There are Sdks, mostly paid, that can read Pdfs, but as I explained it is complicated to read the data reliably. I have already tried to do something similar to what you asked, in my case the best solution was to convert the PDF. A doubt the PDF will always be the same "model" or there will be others? @Ronaldoasevedo

  • There are several tables, but they do not change, only values or more items are added, I added in the main post another table called SICRO that is more complex.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.