How to work with multiple codec’s in pdf?

Asked

Viewed 65 times

0

I have a set of over 1,000 Pdfs that I need to extract the metadata. The problem is that Pdfs have different codecs. The first example worked, I used utf8. The second example gave error. It is Python 3 the code:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

First example, it worked:

def decode_str(string):return string.decode("utf8")

fp = open('EMC 1-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
dados_recuperados = doc.info[0]
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))

print (dados_recuperados)

Resultado -> {'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'Ivanete de Araujo Costa', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114321-07'00'", 'ModDate': b"D:20170314114321-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}

Second example, gave error:

def decode_str(string):
    return string.decode("utf8")

fp = open('EMC 4-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
dados_recuperados = doc.info[0]
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-21-4826dcab6968> in <module>()
      6 doc = PDFDocument(parser)
      7 dados_recuperados = doc.info[0]
----> 8 author = decode_str(dados_recuperados.get("Author"))
      9 subject = decode_str(dados_recuperados.get("Subject"))
     10 creation_date = decode_str(dados_recuperados.get("CreationDate"))

<ipython-input-21-4826dcab6968> in decode_str(string)
      1 def decode_str(string):
----> 2     return string.decode("utf8")
      3 
      4 fp = open('EMC 4-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
      5 parser = PDFParser(fp)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte

print (dados_recuperados)

Resultado -> {'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'\xfe\xff\x00M\x00O\x00D\x00.\x00C\x00O\x00N\x00L\x00E\x00.\x00S\x00T\x00 \x002\x001\x003\x000\x00/\x002\x000\x001\x007\x00 \x00-\x00 \x00P\x00_\x006\x007\x003\x006\x00 \x00-\x00 \x00D\x00a\x00v\x00i\x00 \x00R\x00i\x00b\x00e\x00i\x00r\x00o\x00 \x00d\x00e\x00 \x00O\x00l\x00i\x00v\x00e\x00i\x00r\x00a\x00 \x00J\x00\xfa\x00n\x00i\x00o\x00r', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114344-07'00'", 'ModDate': b"D:20170314114344-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}

There is a way to transform the dados_recuperados = doc.info[0] in a standard codec? Or test before catching the string to know which codec to use?

  • 1

    Turn into a default codec? What is default codec? Default of your application? What you want is to detect the codec and see if it is necessary to decode from UTF8, so if the PDF is utf-8 decodes to Latin, if it doesn’t keep like this, it would be this?

  • Hello, thank you. Yes, that’s it. But note that the second example metadata shows a crazy character set. But when I open Acrobat the metadata appears. So I think I also need to know what is this codec of the second case.

  • For example, the field Author: print(Author) b' xfe xff x00M x00O x00D x00. x00C x00O x00N x00L x00E x00. x00S x00T x00 X002 x001 x003 x000 x00/ X002 x000 x001 x007 x00 x00-x00 x00P x00P x00_x006 x007 x003 x006 x00 x00-x00D x00a x00v x00i x00 x00R x00i x00i x00i x00i x00i x00i x00i x00o x00 x00d x00 x00O x00l x00i x00v x00v x00e x00i x00a x00 x00J xfa x00 xfa x00n x00i x00o x00o x00r x00r'

  • 1

    They are not crazy, are different codecs mixing, have to define which is the default codec of your python program that is using (by setting in charset python script header generally) so that I can know which pattern you want, as I asked in the previous comment: What you want is to detect the codec and see if it is necessary to decode from UTF8, so if the PDF is utf-8 decodes to Latin, if it doesn’t keep like this, it would be this?

  • 1

    That’s not "crazy," that’s "escapes," wait a minute I’ll tell you how to fix it, there’s a link on the website about that.

1 answer

1


What you want is to actually read various types of encodings and convert them all to the encoding you’re using in your python script, you should probably be using something compatible with latin1, so I recommend that before anything you set the default in your script, because if you run that same script on another machine maybe the default on terminal or cmd be totally different.

You can set a pattern you want, let’s imagine that you only want to use utf-8, then add this to your .py at the top:

# -*- coding: utf-8 -*-

If you want to use latin1 only add this:

# -*- coding: latin1 -*-

So coming back, as I said, you probably want to convert any kind of encoding to the current system encoding, in case this link already helps https://stackoverflow.com/a/15918519/1518921, the script is like this:

Add this at the top of your script:

import sys
import cchardet

If you don’t have the cchardet module installed just download it in https://pypi.python.org/pypi/cchardet

And create this function

def str_decode(str):
    # Verifica qual o codec do sistema atual (codec "padrão")
    defaultcodec = sys.getdefaultencoding().lower()

    codec = cchardet.detect(str)['encoding']

    if (defaultcodec != codec.lower()):
        return str.decode(codec) # Se o codec for diferente do sistema atual então decodifica
    else:
        return str # Se o codec for o do sistema atual então mantêm 

Should stay like this:

dados_recuperados = doc.info[0]
author = str_decode(dados_recuperados.get("Author"))
subject = str_decode(dados_recuperados.get("Subject"))
creation_date = str_decode(dados_recuperados.get("CreationDate"))
mod_date = str_decode(dados_recuperados.get("ModDate"))

Note that in the tr.decode(codec) the value of codec is the obtained via cchardet.detect(str)['encoding'], should work well, but there is no guarantee that the PDF document is using only one codec, or that the strings are 100% correct, there may be documents with problems, but this is relative.

If you have set the # -*- coding: xxxxxxx -*- then can adjust the function to:

# Verifica qual o codec do sistema atual (codec "padrão")
defaultcodec = xxxxxxx

The xxxxxxx would be the codec you want by default.

  • Thank you very much. You made that mistake:

  • ----> 3 defaultcodec = sys.getdefaultencoding(). Lower() 4 5 codec = cchardet.Detect(str)['encoding'] Nameerror: name 'sys' is not defined

  • 1

    @Reinaldochaves It matters the module sys. I’ll edit the answer

  • 1

    @Reinaldochaves ready edited, also explained where you can download the cchardet if not in your system.

  • Thanks! I installed cchardet before. It worked almost all right. Now I found that some metadata fields are empty on some Pdfs, I will try to create an if for this

Browser other questions tagged

You are not signed in. Login or sign up in order to post.