Text Mining python or r

Asked

Viewed 107 times

1

I am trying to extract information from PDF files to popular a table without having to read the PDF. Only I can’t find any references that indicate how to do this.

I need, for example, to discover the authors and date of publication of this article:

https://s3.amazonaws.com/academia.edu.documents/43803310/Completing_an_intercalated_research_degr20160316-636-na87j7.pdf?AWSAccessKeyId=AKIAIWOYYGZ2Y53UL3A&Expires=1539106804&Signature=rSC0Kyg4%2FXltsqiX3eYuRmssc%3D&sponse-content-disposition=inline%3B%20filename%3DCompleting_an_intercalated_research_degr.pdf

I would like package tips/functions in python or r.

Note: already able to extract text from pdf, what I do not know how to do is find the information I need within the text, given that I do not have the exact text to be searched.

1 answer

1

PDF files may have special fields to store this data, as author and date, but I opened the PDF you sent and in it these fields are not filled:

propriedades do pdf

So there is no magic, you will need to parse the text and extract the data directly, since the PDF does not offer this data in an organized way.

If you don’t know the exact text to be searched for, you can make possibilities and make your program try every possibility until you find one that can get the data.

For example, in the PDF listed, you can try comparing each line to the PDF name to find the full title, and consider the next line as author.

Another option is to search for the acronym ISSN, and if you find it, you can take the number and search on sites like https://www.ncbi.nlm.nih.gov/nlmcatalog?linkbar=plain&db=journals&term=1175-8716 and extract the data you want from the site instead of PDF.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.