How to isolate metadata with pdfminer?

Question

How to isolate metadata with pdfminer?

Asked 7 years, 10 months ago

Viewed 128 times

0

I wrote this code in Python 3 to read the metadata of a PDF:

>>> from pdfminer.pdfparser import PDFParser
>>> from pdfminer.pdfdocument import PDFDocument
>>> fp = open('EMC 1-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
>>> parser = PDFParser(fp)
>>> doc = PDFDocument(parser)
>>> print(doc.info)

And as a result generates:

[{'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'Ivanete de Araujo Costa', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114321-07'00'", 'ModDate': b"D:20170314114321-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}]

Please does anyone know how to isolate the results in variables? For example, in the above case get the results:

a = "Ivanete de Araujo Costa" (campo Author)
b = "EMD ADI - Emenda Aditiva" (campo Subject)
c = "D:20170314114321-07'00" (campo CreationDate)
d = "D:20170314114321-07'00" (campo ModDate)

1 answer

Browser other questions tagged python pdf

You are not signed in. Login or sign up in order to post.

by Marlysson • **905** points · Answer 1 · 2017-09-15T11:43:38+00:00

As you can already recover the PDF content becomes easier.. this returned structure is a dictionary, where each key points to a value, and its syntax is as follows:

dicionario.get("chave","valor_default")

Based on this example let’s go to your.

As already has the dictionary let’s separate into variables:

# Código anterior

#Aqui recupero o dicionário, pois ele está dentro de uma lista
dados_recuperados = doc.info[0]

#Aqui crio uma função para retornar a string literal, visto que é retornado a forma em bytes do conteúdo.
def decode_str(string):
    return string.decode("utf8")

# E por fim recupero cada chave passando ela para a função de conversão de conteúdo.
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))

# Fazer qualquer coisa com as variáveis.

Behold here the script running on Ideone.