Convert PDF to text with Python

Question

Convert PDF to text with Python

Asked 8 years, 6 months ago

Viewed 4,205 times

5

Well, I have a PDF file that is on a website, I would like to know how to take the text of this PDF and put it in a variable.

Access the site with PDF I know, my difficulty is in converting this PDF into text, or simply copying the text.

I’m using Python 3.x

1 answer

Browser other questions tagged python python-3.x

You are not signed in. Login or sign up in order to post.

by Sidon • **6,563** points · Answer 1 · 2017-09-10T15:30:10+00:00

Being a PDF an image, to extract the texts is necessary an OCR package (it is necessary to keep in mind that these packages may not have 100% of hit), there are several of them in python, for what you want has a very interesting that works in python 2.7 and 3.4, textract.

Take an example:

import textract
text = textract.process("orcamento.pdf")
print (text)  

Clicar para incluir o cabeçalho

EXEMPLO DE ORÇAMENTO: Exemplos de Itens Detalhados
OBSERVAÇÃO : Este é somente um exemplo. Nem todos os orçamentos terão todos os exemplos listados abaixo. Favor usar somente os itens que dizem
respeito ao seu projeto proposto.

I. SALÁRIOS
Diretor Executivo
Diretor de Projeto
Contador
Editor Sênior
Editor

Salário Anual
5000
4000
2000
750
500

Porcentagem
50%
100%
50%
20%
45%

I used this pdf for example, of course I copied only part of the result, just for demonstration.

Obs.:

In your case, you would have to download the pdf to a local directory and carry out the example process.
To install in python 3, see this link.