0
Hello, I need to extract a specific snippet from a PDF. The idea is to find the state registrations that were downloaded in the official journal of the state where I live, this is for the collection of data for the creation of a TCC. That’s why I’m creating a code so he can open the PDF and get the state applications. The pdf always contains the keyword "downloaded state subscriptions" and then comes her number, the idea is that the number be extracted.
for now I’ve been able to make that code:
import PyPDF2 as p2
pdf = open('DOEAL-25_09_2020-COMPLETO.pdf', 'rb')
pdf_reader = p2.PdfFileReader(pdf)
n = pdf_reader.numPages
for i in range(0, n):
print('Página {}'.format(i+1))
page = pdf_reader.getPage(i)
if page.extractText() != "":
conteudo = page.extractText()
else:
print("image")
with open("teste_de_pdf.txt", 'a', encoding='utf-8') as arq:
arq.write(conteudo)
With it I am able to read the PDF and import its contents to TXT, but wanted to filter the content. Someone could help me?