Read python PDF file

Asked

Viewed 1,533 times

0

I am trying to create a script that reads a PDF identify some points call, then mark the pages that are these keys, and split the page itself. And then make a merge thus generating a new summary PDF.

import PyPDF2
pdf_file = open('nomefatura.pdf','rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
page = pdf_reader.getPage(1)
text = page.extractText()
print(text)

But that’s the way out:

5G=AK;"A?" BCP;<>;<. -/78.7:-1Y. ,7.77-ZABT;"A?" SJB? @>? 6Y-ZABT;"A?" L=>GK==J;K ! N==" *=J;K ! N==" *=J;K ! N===" *=J;K ! 5%4FHGRCGP$D6DGD54RHDG636VC3C7CC7C -1+>7C3CC -1+>C3CC5%4FHGRCG%4C$8D6DGD54RHG64%CGC36G64%CGC3V56GG%6F635D7R3FG -1+>7C3CC -1+>C3CC5%4FHGRCG%44$8D6DGD54RHFR%DHV3H4R%DVC3C6GH563GC7R3GCG -1+>C3CC -1+>C3CC -1+>C3CC%C3CC%CC%CHCGRCG%46$8D6DGDGDGD54RH54R%VG54R%VVG4R%VVGC6%%1+>7C3CC -1+>C3CC5%4FHGRCG%4R$8D6DGD54RHH6D%D653VC6D%DRC3GC6GF%4C53FR7R3R3FG -1+>7C3CC -1+>C3CCNO$ Pj#DV3C4N@ (Tpj#66G3D4(*#+10,I2*0,1-#f#T02#46%5D4E6C46&Y#. -,1)2_<2bc0Y#+-#NO$ #g4kh#0#NO@ (T#gC3Gkh#,=-#Y=-#)0Z+YY+I+Y#fY#1+)2\+Y%4? LBSB? @>? <"3GAB>BR;<"?" A="Y=J="JBTG? (")'))"]F9"$H')"RB=">? J? L;@?" LBO;"S;C"044%3>? @ABC? @>;"=;"SJB? @>?" 0./D"E9FF";G"9)H#%@+#/-)+#I0#. -*Z>01+)#Y<+Y#>2;+bc0Y#I0#>-,;+#I2Y1l,. 2+#'-. m#Z)0.2Y+#I2;21+)#-#. nI2;-#I0#<*+#-Z0)+I-)+#o<0#Z)0Y10#0Y10#Y0)'2b-#,+#Y<+#)0;2=-%#? -,/0b+#1-I-Y#0#\+b+#Y<+#0Y. ->/+B !" #$%&''#$%&#7# -I-#-#9)+Y2>#p4G#7# 0>0\q,2.+ #7# -I-#-#9)+Y2>#p64#7#? >+)-#7# -I-#-#9)+Y2>#pR4# 0>0+)#7# -I-#-#9)+Y2>#p4D#7#L2#7#$83#8P3#$? 3#P$3#"$3#9&3#$83#"M3#ML3#PA#p5G#7#W2Z^+r#7#? nI2;-#,+. 2-,+>#DRp#46#7#&>;+)#7#"M#gY01-)#Rh3#$8#gY01-)#RRh3#"$#gY01-)#66h3#ML#gY01-)#6Gh3#pV4#7#! 8#? LP8#7# -I-#-#9)+Y2>#pHG#7# 0>0.-#FG#7#? nI2;-#,+. 2-,+>#FG3#pdv#7#?+_)2I;0#7#$8#gY01-)#R4hp#6F#7#! Q #9)+Y2>#7#$83#PA3#"M3#8P3#P$#p#8P76D#7#$0).-10>9+,. -Y#. -,'0,2+I-YB#s#9&$&#7#9+,. -#I+#&+Jq,2+#s#9+,. -#I-#@-)I0Y10#s#9&@($ ($#t#9+,. -#I-#(Y1+I-#I-#(YZe)21-#$+,1-#s#$+,1+,I0)#s#9&@8&P&#t#9+,. -#I-#(Y1+I-#I-#8+)X#s#9&@($(#t#9+,. -#I-#(Y1+I-#I0#$0);2Z0#s#9P9#t#9+,. -#I0#9)+Ye>2+#s#9+,. -#! @ (P#s#O@! 8P!" (#s#? (? P(Q#s#9)+I0Y.-#s#! 1+a#s#9+,. -#"0).+ ,12>#s#9+,. -#$+)+#s# )2_+,. -#s#9&@$!? P(Q#s#9+,. --_#s#8+;NX.2>#s#9+,. -#I-#9)+Y2>#s#?+ 2]+#(. -,q*2.+ #N0I0)+>#s#9+,)2Y<>#s#&,+10>#4RR4N+1<)+N+1<)+#I0#8+;+0,1-B#D6DGFFR5HF !"#$%&%&'%()+,-#"+)./01123456#7#8+)10#9#:;<+#9)+,.+ #7#$=-#8+<>-7$8? @8AB#C6%D64%D64ECCF76F#7#! %(%B#44G%FCH%CFG%44C? >20.10B#5%4FHGRCG? 8NE? @8AB#4CDCDH6CCCC4GD(*2YY=-B#C6ECGE6C6C#8-Y1+;0*B#4DECGE6C6CP0 0)m,. 2+B#"&! E6C#80)e-i-B#C4ECDE6C#+#RCECDE6CQ`_21-#+<1-*X12.-B#CCCCCV4RRRG4HR5GC4G8X;2,+#6#I0#6

I did a lot of research and couldn’t find a solution yet.

At first as I understood it is not possible to make a PDF download itself, for being binary, 'rb', if I just tell you to do the good reading too.

inserir a descrição da imagem aqui

Attempts to convert:

print(text.encode('latin-1'))
print(text.encode('utf-8'))
print(text.encode('ISO-8859-1'))

2 answers

0

Solution I found to be able to read the PDF without having trouble:

import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
import io
import glob
from datetime import datetime

fp = open("C:\\repositorio\\teste_pdf\\pdf_split\\fatura.pdf", 'rb')
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
pages = PDFPage.get_pages(fp)
for page in pages:
    interpreter.process_page(page)
    data =  retstr.getvalue()
    print(data)

-1

I looked at your code and there was nothing wrong, compared to what I wrote.

import Pypdf2

pdf_file = open('/home/Unix-power/Documents/Devs/PDF_/P1.pdf', 'Rb')

read_pdf = Pypdf2.Pdffilereader(pdf_file)

page = read_pdf.getPage(0)

page_content = page.extractText()

pdf = page_content.Encode('utf-8')

print(pdf)

In the post image that points out the possible problem the statement that is econding of the pdf to be read and not in the coding made, already tried to pass another different pdf than this trampling ?

  • Opa, already yes, some PDF does the reading correctly. I just forgot to comment again here, I found another solution for what I needed to do.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.