1
Good afternoon! I use Camelot to extract data from PDF files (bank statements to be more accurate). However, I have a PDF file with more than 5000 pages, and Camelot is a bit slow. I decided to create a script where Camelot processes a number x of pages at a time, until all pages are extracted. in the process storing all tables found in a dataframe. occurs that I believe the dataframe is becoming somewhat large. i export the dataframe to excel every x pages. however not zero as I need to increment the excel file.
I wonder if there is any way to save this data, incrementing the file, thus zeroing the dataframe every x pages. to_csv defaults text. to_excel does not allow append.
import camelot, PyPDF2, pandas, tqdm
from tkinter import Tk, filedialog as dlg
Tk().withdraw()
file=dlg.askopenfilename()
np=int(input('informe a quantidade de paginas por execução: >> '))
final=pandas.DataFrame()
j=PyPDF2.PdfFileReader(file)
pag=j.getNumPages()
ini=int(1)
eend=int(np)
for k in tqdm.tqdm(range(1,int(pag),np)):
print(ini)
pp=str(ini)+"-"+str(eend)
if eend=='end':
t=camelot.read_pdf(file,flavor='stream',pages=str(pp))
for n in t:
final=final.append(n.df)
final.to_excel(file.strip('.pdf')+'.xlsx')
break
t=camelot.read_pdf(file,flavor='stream',pages=str(pp))
for n in t:
final=final.append(n.df)
final.to_excel(file.strip('.pdf')+'.xlsx')
ini+=np
eend+=np
if eend > int(pag):
eend="end"
@Imonferrari I believe has improved the efficiency, I tested here at home, the computer is a bit slow, tomorrow I will put the test at work. I’m still learning, a lot of things I still use the crude way of doing, but with each script I learn a new library or a simpler way of programming. thanks for the help!
– Juliano Rodrigues
@Julianorodrigues, good night! That’s it, always learning! Hopefully improve the efficiency in your work the script. Anything we are there. Hug!
– lmonferrari
@Imonferrari tested here and got a good optimized, I noticed that pd.Concat is scrambling the data, so I removed the dataframe 'a' script, and I left item.df.to_csv direct.. , I’m testing the tablet now. seems to be faster. but I’m noticing the lack of some exported lines..
– Juliano Rodrigues
@Julianorodrigues, opa good morning! Possibly it works too, I had no way to test so complica kkkk.
– lmonferrari
@Imonferrari tested using the Camelot, there was such a good otimzação! replaced by tabula-py, now what would take me 5 to 6 hours with Camelot with you in 15 to 20 minutes!!
– Juliano Rodrigues
@Julianorodrigues Show, very good! Big hug!
– lmonferrari