0
I wrote this code in Python 3 to read the metadata of a PDF:
>>> from pdfminer.pdfparser import PDFParser
>>> from pdfminer.pdfdocument import PDFDocument
>>> fp = open('EMC 1-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
>>> parser = PDFParser(fp)
>>> doc = PDFDocument(parser)
>>> print(doc.info)
And as a result generates:
[{'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'Ivanete de Araujo Costa', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114321-07'00'", 'ModDate': b"D:20170314114321-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}]
Please does anyone know how to isolate the results in variables? For example, in the above case get the results:
a = "Ivanete de Araujo Costa" (campo Author)
b = "EMD ADI - Emenda Aditiva" (campo Subject)
c = "D:20170314114321-07'00" (campo CreationDate)
d = "D:20170314114321-07'00" (campo ModDate)
Thank you! One of the files gave encoding error. I put it here: https://docs.google.com/document/d/1byd6Z8hHU4CBMU9ozCSJxRNLlk55iy0_pVlFX3tp7kY/edit?usp=sharing Unfortunately it seems I have files with different encoding. You know if there’s a solution?
– Reinaldo Chaves
Your script is on a public ipython notebook or github?
– Marlysson
You are here https://github.com/reichaves/reftrab
– Reinaldo Chaves
Now you are on github with the error message - Unicodedecodeerror: 'utf-8' codec can’t Decode byte 0xfe in position 0: invalid start byte
– Reinaldo Chaves