Convert PDF to text with Python

Asked

Viewed 4,205 times

5

Well, I have a PDF file that is on a website, I would like to know how to take the text of this PDF and put it in a variable.

Access the site with PDF I know, my difficulty is in converting this PDF into text, or simply copying the text.

I’m using Python 3.x

1 answer

2

Being a PDF an image, to extract the texts is necessary an OCR package (it is necessary to keep in mind that these packages may not have 100% of hit), there are several of them in python, for what you want has a very interesting that works in python 2.7 and 3.4, textract.

Take an example:

import textract
text = textract.process("orcamento.pdf")
print (text)  

Clicar para incluir o cabeçalho

EXEMPLO DE ORÇAMENTO: Exemplos de Itens Detalhados
OBSERVAÇÃO : Este é somente um exemplo. Nem todos os orçamentos terão todos os exemplos listados abaixo. Favor usar somente os itens que dizem
respeito ao seu projeto proposto.

I. SALÁRIOS
Diretor Executivo
Diretor de Projeto
Contador
Editor Sênior
Editor

Salário Anual
5000
4000
2000
750
500

Porcentagem
50%
100%
50%
20%
45%

I used this pdf for example, of course I copied only part of the result, just for demonstration.

Obs.:

  • In your case, you would have to download the pdf to a local directory and carry out the example process.
  • To install in python 3, see this link.
  • textract.exceptions.Shellerror: The command pdftotext oi.pdf - failed with Exit code 127 ------------ stdout -------------- ------------- stderr -------------

  • 'cause I’m the one :(

  • @Luanpedro It would be interesting to present the context in which this happens, how about showing a fragment of code?

  • My Cod > https://pastebin.com/pdMbxhXa Newsletter used > https://www.sendspace.com/file/08khoa ERROR: http://prntscr.com/kwg16g My topic about this > https:///pt.stackoverflow.com/questions/329889/pegar-valores-emboletim-usando-python-ocr

  • @Luanpedro Putz... keep going after images is complicated, better put here. But if you take the time I will check. :-)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.