How to isolate metadata with pdfminer?

Asked

Viewed 128 times

0

I wrote this code in Python 3 to read the metadata of a PDF:

>>> from pdfminer.pdfparser import PDFParser
>>> from pdfminer.pdfdocument import PDFDocument
>>> fp = open('EMC 1-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
>>> parser = PDFParser(fp)
>>> doc = PDFDocument(parser)
>>> print(doc.info)

And as a result generates:

[{'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'Ivanete de Araujo Costa', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114321-07'00'", 'ModDate': b"D:20170314114321-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}]

Please does anyone know how to isolate the results in variables? For example, in the above case get the results:

a = "Ivanete de Araujo Costa" (campo Author)
b = "EMD ADI - Emenda Aditiva" (campo Subject)
c = "D:20170314114321-07'00" (campo CreationDate)
d = "D:20170314114321-07'00" (campo ModDate)

1 answer

0


As you can already recover the PDF content becomes easier.. this returned structure is a dictionary, where each key points to a value, and its syntax is as follows:

dicionario.get("chave","valor_default")

Based on this example let’s go to your.

As already has the dictionary let’s separate into variables:

# Código anterior

#Aqui recupero o dicionário, pois ele está dentro de uma lista
dados_recuperados = doc.info[0]

#Aqui crio uma função para retornar a string literal, visto que é retornado a forma em bytes do conteúdo.
def decode_str(string):
    return string.decode("utf8")

# E por fim recupero cada chave passando ela para a função de conversão de conteúdo.
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))

# Fazer qualquer coisa com as variáveis.

Behold here the script running on Ideone.

  • Thank you! One of the files gave encoding error. I put it here: https://docs.google.com/document/d/1byd6Z8hHU4CBMU9ozCSJxRNLlk55iy0_pVlFX3tp7kY/edit?usp=sharing Unfortunately it seems I have files with different encoding. You know if there’s a solution?

  • 1

    Your script is on a public ipython notebook or github?

  • You are here https://github.com/reichaves/reftrab

  • Now you are on github with the error message - Unicodedecodeerror: 'utf-8' codec can’t Decode byte 0xfe in position 0: invalid start byte

Browser other questions tagged

You are not signed in. Login or sign up in order to post.