0
I have a set of over 1,000 Pdfs that I need to extract the metadata. The problem is that Pdfs have different codecs.
The first example worked, I used utf8
. The second example gave error. It is Python 3 the code:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
First example, it worked:
def decode_str(string):return string.decode("utf8")
fp = open('EMC 1-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
dados_recuperados = doc.info[0]
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))
print (dados_recuperados)
Resultado -> {'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'Ivanete de Araujo Costa', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114321-07'00'", 'ModDate': b"D:20170314114321-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}
Second example, gave error:
def decode_str(string):
return string.decode("utf8")
fp = open('EMC 4-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
dados_recuperados = doc.info[0]
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-21-4826dcab6968> in <module>()
6 doc = PDFDocument(parser)
7 dados_recuperados = doc.info[0]
----> 8 author = decode_str(dados_recuperados.get("Author"))
9 subject = decode_str(dados_recuperados.get("Subject"))
10 creation_date = decode_str(dados_recuperados.get("CreationDate"))
<ipython-input-21-4826dcab6968> in decode_str(string)
1 def decode_str(string):
----> 2 return string.decode("utf8")
3
4 fp = open('EMC 4-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
5 parser = PDFParser(fp)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte
print (dados_recuperados)
Resultado -> {'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'\xfe\xff\x00M\x00O\x00D\x00.\x00C\x00O\x00N\x00L\x00E\x00.\x00S\x00T\x00 \x002\x001\x003\x000\x00/\x002\x000\x001\x007\x00 \x00-\x00 \x00P\x00_\x006\x007\x003\x006\x00 \x00-\x00 \x00D\x00a\x00v\x00i\x00 \x00R\x00i\x00b\x00e\x00i\x00r\x00o\x00 \x00d\x00e\x00 \x00O\x00l\x00i\x00v\x00e\x00i\x00r\x00a\x00 \x00J\x00\xfa\x00n\x00i\x00o\x00r', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114344-07'00'", 'ModDate': b"D:20170314114344-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}
There is a way to transform the dados_recuperados = doc.info[0]
in a standard codec? Or test before catching the string
to know which codec to use?
Turn into a default codec? What is default codec? Default of your application? What you want is to detect the codec and see if it is necessary to decode from UTF8, so if the PDF is utf-8 decodes to Latin, if it doesn’t keep like this, it would be this?
– Guilherme Nascimento
Hello, thank you. Yes, that’s it. But note that the second example metadata shows a crazy character set. But when I open Acrobat the metadata appears. So I think I also need to know what is this codec of the second case.
– Reinaldo Chaves
For example, the field Author: print(Author) b' xfe xff x00M x00O x00D x00. x00C x00O x00N x00L x00E x00. x00S x00T x00 X002 x001 x003 x000 x00/ X002 x000 x001 x007 x00 x00-x00 x00P x00P x00_x006 x007 x003 x006 x00 x00-x00D x00a x00v x00i x00 x00R x00i x00i x00i x00i x00i x00i x00i x00o x00 x00d x00 x00O x00l x00i x00v x00v x00e x00i x00a x00 x00J xfa x00 xfa x00n x00i x00o x00o x00r x00r'
– Reinaldo Chaves
They are not crazy, are different codecs mixing, have to define which is the default codec of your python program that is using (by setting in
charset
python script header generally) so that I can know which pattern you want, as I asked in the previous comment: What you want is to detect the codec and see if it is necessary to decode from UTF8, so if the PDF is utf-8 decodes to Latin, if it doesn’t keep like this, it would be this?– Guilherme Nascimento
That’s not "crazy," that’s "escapes," wait a minute I’ll tell you how to fix it, there’s a link on the website about that.
– Guilherme Nascimento