How to read PDF

Question

How to read PDF

Asked 8 years, 9 months ago

Viewed 7,101 times

2

I am creating a script to grab a pdf and rewrite it in text.

from StringIO import StringIO
from slate import PDF
from subprocess import Popen, PIPE, call
import uuid

#pego pdf existente
url = "/tmp/arquivo.pdf"
    with open(url, "r") as arq:
        out = arq.read()

    #novo arquivo para parsear o pdf
    newfile = "/tmp/teste/" + str(uuid.uuid4()) + ".txt"

    with open(newfile, "wb") as arq:
       arq.write(out)

But this is the way out:

'PDF-1.7 r n% xa1 xb3 xc5 xd7 r n1 0 obj r n<>>> r nendobj r N2 0 obj r n<

The result was not as expected and one person passed me over the call (but did not explain to me) and over Java Pdfbox, then he passed me this code:

call(["java", "-jar", "/tmp/teste/pdfbox-app-2.0.3.jar", "ExtractText", out, newfile])

I tried to use but could not, already starts giving error by "java". I tried calling "python" and it works but that’s not what I need.

I searched but did not find making a call to Java as an example. Does scroll use?

I want a readable text and the pdf to be printed in the right order (respecting columns, lines, etc.) How do I convert a pdf into a text?

1 answer

Browser other questions tagged python python-2.7 pdf

You are not signed in. Login or sign up in order to post.

by jsbueno • **30,668** points · Answer 1 · 2016-10-27T21:55:33+00:00

Reading a PDF is a much more complicated process than it sounds. If you just want to extract the text, this library slate that you are caring is what makes it - only that in your attempt you even call Slate.

Another thing is that a PDF file should be opened for reading in binary mode - you put "Rb" in open mode - otherwise, by default, it opens as text, and automatic translation destroys the structure of the data read.

from slate import PDF
from tempfile import mktemp
...

output_name = mktemp() + ".txt"

with open(url, 'rb') as pdf_file, open(output_name, 'wt') as output:
    doc = PDF(pdf_file)
    for page in doc:
        output.write(page + '\n')

(The example of how to use Slate is in: https://pypi.python.org/pypi/slate)