How to remove an excerpt from a specific text in a PDF with Python

Asked

Viewed 289 times

0

Hello, I need to extract a specific snippet from a PDF. The idea is to find the state registrations that were downloaded in the official journal of the state where I live, this is for the collection of data for the creation of a TCC. That’s why I’m creating a code so he can open the PDF and get the state applications. The pdf always contains the keyword "downloaded state subscriptions" and then comes her number, the idea is that the number be extracted.

for now I’ve been able to make that code:

import PyPDF2 as p2

pdf = open('DOEAL-25_09_2020-COMPLETO.pdf', 'rb')

pdf_reader = p2.PdfFileReader(pdf)

n = pdf_reader.numPages

for i in range(0, n):
    print('Página {}'.format(i+1))
    page = pdf_reader.getPage(i)
    if page.extractText() != "":
        conteudo = page.extractText()
    else:
        print("image")
    with open("teste_de_pdf.txt", 'a', encoding='utf-8') as arq:
        arq.write(conteudo)

With it I am able to read the PDF and import its contents to TXT, but wanted to filter the content. Someone could help me?

1 answer

1


Use the module re

import re

texto = """inscrições  estaduais baixadas
123456 pag 32.1
lore inscrições estaduais baixadas 123123 gj  inscrições estaduais
 baixadas 111111  quam efficitur dignissim. Nam non 222 tortor
nisl. inscrições estaduais baixadas 777777  Vivamus sit amet
number: 2  felis sit amet leo mattis inscrições estaduais baixadas
666666"""

print(re.findall(r'baixadas\s(\d+)', texto))
# > ['123456', '123123', '111111', '777777', '666666']

Browser other questions tagged

You are not signed in. Login or sign up in order to post.