Scraping in Python - read pdf

Asked

Viewed 2,294 times

6

I made a Scrapping in Python that takes a URL of any PDF, reads and returns, but in some Pdfs I’m having the problem of coming with some characters like this:

".\nO xc3 xb3rg xc3 xa3o tamb xc3 xa9m divulga result nGH xc3 x80QLWLYR x03GRV x03FDQGLGDWRV x03TXH x03VH x03GHFODUDP x03FRP x03GH xc3 x80FLrQFLD x03H x03GRV x03SHGLGRV x03 nde special service granted. The competition is aimed at providing of 150 vacancies for the class (Class A) of delegate Pol xc3 xadcia Civil, whose vacancies are xc3 xa3o n nprovidas as a order of clasVL xc3 x80FDomR x03H x03D x03QHFHVLGDGH x03GR x03VHUYLoR X11 on"

From what I can see, this happens when there is some accent, column or even trace in the document..

I also noticed that if you have picture, it returns strange characters! Someone has some solution or idea that can help me?

  • 1

    have tried .unicode('utf-8') (utf8) I don’t really remember...

  • Guys, thanks for the help. There really is Encode and Decode that helps to solve these characters in UTF-8... But in the text still has a portion of text that does not work, would be in these excerpts: "GHFODUD".

2 answers

3

Another alternative is to use str.encode with encoding Latin 1 and str.decode to decode to UTF-8. See an example:

print ("\xc3\xb3".encode('latin1').decode('utf-8')) # ó

In your case, do so:

print (texto.encode('latin1').decode('utf-8'))

Where texto is the variable you want to apply the Encode/Decode.

Upshot:

O órgão também divulga resultado

GHÀQLWLYRGRVFDQGLGDWRVTXHVHGHFODUDUDPFRPGHÀFLrQFLDHGRVSHGLGRV
de atendimento especial deferidos.
O concurso visa o provimento
efetivo de 150 vagas para a classe
inicial (Classe A) do cargo de delegado de Polícia Civil, cujas vagas serão

providas conforme a ordem de clasVLÀFDomRHDQHFHVVLGDGHGRVHUYLoR
A

2

Using Python and pdfminer (pdfminer3k for Python 3), I implemented PDF reading through the following class:

import pdfminer
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.pdfdevice import TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from pdfminer.utils import set_debug_logging
import io

class LeitorPdf():
    def __init__(self, **kwargs):
        self.resource_manager = PDFResourceManager(caching=False)
        self.output_stream = io.StringIO()
        self.device = TextConverter(self.resource_manager, self.output_stream, laparams=None)

    def extrair_texto(self, file_name):
        fp = io.open(file_name, 'rb')
        process_pdf(self.resource_manager, self.device, fp, set(), maxpages=0, password='', caching=False, check_extractable=True)
        return self.output_stream.getvalue()

The PDF needs to be saved somewhere before.

Use:

texto = LeitorPdf().extrair_texto(nome_arquivo)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.