2
I am creating a script to grab a pdf and rewrite it in text.
from StringIO import StringIO
from slate import PDF
from subprocess import Popen, PIPE, call
import uuid
#pego pdf existente
url = "/tmp/arquivo.pdf"
with open(url, "r") as arq:
out = arq.read()
#novo arquivo para parsear o pdf
newfile = "/tmp/teste/" + str(uuid.uuid4()) + ".txt"
with open(newfile, "wb") as arq:
arq.write(out)
But this is the way out:
'PDF-1.7 r n% xa1 xb3 xc5 xd7 r n1 0 obj r n<>>> r nendobj r N2 0 obj r n<
The result was not as expected and one person passed me over the call (but did not explain to me) and over Java Pdfbox, then he passed me this code:
call(["java", "-jar", "/tmp/teste/pdfbox-app-2.0.3.jar", "ExtractText", out, newfile])
I tried to use but could not, already starts giving error by "java". I tried calling "python" and it works but that’s not what I need.
I searched but did not find making a call to Java as an example. Does scroll use?
I want a readable text and the pdf to be printed in the right order (respecting columns, lines, etc.) How do I convert a pdf into a text?
Ui, I managed to use that Java library. Pity that still bugged with the first page practically full like this: "!" #$ % &". But anyway... I’ll try Slate!
– Vanessa Nunes