optimize Camelot large pdf files

Asked

Viewed 64 times

1

Good afternoon! I use Camelot to extract data from PDF files (bank statements to be more accurate). However, I have a PDF file with more than 5000 pages, and Camelot is a bit slow. I decided to create a script where Camelot processes a number x of pages at a time, until all pages are extracted. in the process storing all tables found in a dataframe. occurs that I believe the dataframe is becoming somewhat large. i export the dataframe to excel every x pages. however not zero as I need to increment the excel file.

I wonder if there is any way to save this data, incrementing the file, thus zeroing the dataframe every x pages. to_csv defaults text. to_excel does not allow append.

import camelot, PyPDF2, pandas, tqdm
from tkinter import Tk, filedialog as dlg

Tk().withdraw()
file=dlg.askopenfilename()
np=int(input('informe a quantidade de paginas por execução: >> '))
final=pandas.DataFrame()

j=PyPDF2.PdfFileReader(file)
pag=j.getNumPages()
ini=int(1)
eend=int(np)


for k in tqdm.tqdm(range(1,int(pag),np)):
    

    print(ini)
    pp=str(ini)+"-"+str(eend)
    if eend=='end':
        t=camelot.read_pdf(file,flavor='stream',pages=str(pp))
        for n in t:
            final=final.append(n.df)
        final.to_excel(file.strip('.pdf')+'.xlsx')
        break
    t=camelot.read_pdf(file,flavor='stream',pages=str(pp))
    for n in t:
        final=final.append(n.df)
    final.to_excel(file.strip('.pdf')+'.xlsx')
        
    ini+=np
    eend+=np
    if eend > int(pag):
        eend="end"

3 answers

1

Maybe this will give some optimized

import camelot, PyPDF2, tqdm
import pandas as pd
from tkinter import Tk, filedialog as dlg

Tk().withdraw()

file_path = dlg.askopenfilename()
last_page = PyPDF2.PdfFileReader(file_path).getNumPages()

number_pages = int(input('informe a quantidade de páginas por execução: >> '))

ini = 1
eend = 0

for k in tqdm.tqdm(range(1, last_page, number_pages)):
    a = pd.DataFrame()

    eend += number_pages 
    
    pages = f'{ini}-{eend}'
    file = camelot.read_pdf(file_path, flavor = 'stream' , pages = pages)
    [a := pd.concat([a, item.df]) for item in file]
    
    a.to_csv('output.csv', encoding = 'latin-1', mode = 'a')
    ini += number_pages

I added a comprehension list and deleted a for that existed in your code. The encoding you can modify to the one that best suits you when saving csv. I haven’t done any validation for the size you can split the pages, so remember to calculate before if it is possible to do the splitting of pages.

  • @Imonferrari I believe has improved the efficiency, I tested here at home, the computer is a bit slow, tomorrow I will put the test at work. I’m still learning, a lot of things I still use the crude way of doing, but with each script I learn a new library or a simpler way of programming. thanks for the help!

  • @Julianorodrigues, good night! That’s it, always learning! Hopefully improve the efficiency in your work the script. Anything we are there. Hug!

  • @Imonferrari tested here and got a good optimized, I noticed that pd.Concat is scrambling the data, so I removed the dataframe 'a' script, and I left item.df.to_csv direct.. , I’m testing the tablet now. seems to be faster. but I’m noticing the lack of some exported lines..

  • @Julianorodrigues, opa good morning! Possibly it works too, I had no way to test so complica kkkk.

  • @Imonferrari tested using the Camelot, there was such a good otimzação! replaced by tabula-py, now what would take me 5 to 6 hours with Camelot with you in 15 to 20 minutes!!

  • @Julianorodrigues Show, very good! Big hug!

Show 1 more comment

1

after the help and test the use of the tabula the script was like this:

import tabula, PyPDF2, tqdm
from tkinter import Tk, filedialog as dlg

Tk().withdraw()
file_path = dlg.askopenfilename()
last_page = PyPDF2.PdfFileReader(file_path).getNumPages()

number_pages = input('informe a quantidade de páginas por execução: >> ')
if number_pages=='':number_pages=last_page
else:number_pages=int(number_pages)

ini = 1
eend = 0
for k in tqdm.tqdm(range(1, last_page, number_pages)):
    eend += number_pages
    if eend > last_page:
        eend=last_page
    
    pages = f'{ini}-{eend}'
    print(pages)
    
    file = tabula.read_pdf(file_path,guess=False,silent=True,pages = pages)
    for item in file:
        #acrescentei o try devido alguns erros de encoding. assim não houve perda de dados.
        try:
            item.to_csv(file_path.strip('.pdf')+'.csv', encoding = 'latin-1', mode = 'a')
        except:
            item.to_csv(file_path.strip('.pdf')+'.csv', mode = 'a')
        
    ini += number_pages

there was a very big improvement with the use of the tablet.

1

I have no way to test your program. So the answer will be more theoretical.

Get to the idea:

  • Create a list of results (dataframes) and then concatenate them

See the example below:

Creating the dataframes

import pandas as pd

def processa_pagina_do_pdf(n):
    return pd.DataFrame({"col": [n]})    


df_list = []

steps = 10

for x in range(0, 100, steps):
    for pagina in range(x, x+steps):
        df_tmp = processa_pagina_do_pdf(pagina)
        df_list.append(df_tmp)

Concatenating the Dataframe List

final_df = pd.concat(df_list)

Then just save the final_df for csv or Excel.

Note: Give a chance to Pymupdf. Good and fast to extract PDF data

I hope it helps resolve the issue

  • thanks for the help Paul, I will test the Pymupdf

Browser other questions tagged

You are not signed in. Login or sign up in order to post.