Extract images from a pdf file using a python script

Asked

Viewed 257 times

-3

good morning, I have this code

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import os
try:
    from textract import *
except ModuleNotFoundError:
    os.system('sudo apt-get install -y python3 python-dev python-pip build-essential swig git libpulse-dev && pip3 install pocketsphinx && pip3 install textract')
    os.system('pip3 install textract')
    from textract import *
# É inserido o ficheiro
ficheiro=input('insira o ficheiro pdf:')
#processa o ficheiro
data =process(ficheiro)
#imprime para o ecra e descodifica o texto
print (data.decode('utf8'))

The purpose of this code was to open a pdf file and from it extract text and images but it is only taking the text

Does anyone have any idea how to solve this problem?

1 answer

-1

In Python I do not know, but follows in Ubuntu:

1) Create a Makefile. Here is a PDF file named 'litle_invest' and directory 'LI'':

P1:

 mkdir -p LI
 pdfimages -all litle_invest.pdf nome da pasta/img

P2 litle_invest.txt: litle_invest.pdf

pdftotext litle_invest.pdf 
pdftotext -layout litle_invest.pdf 
pdftohtml  litle_invest.pdf
pdftohtml -xml  litle_invest.pdf

2) On the command line:

$ make P1

$ make P2 (in this case, 'litle_invest.txt: litle_invest.pdf' is all in the same line, different from P1)

3) The result is all images extracted from the PDF to the 'LI' directory, and created versions in txt, html and xml.

4) I hope I have helped.

  • add your answer as a comment instead of a response if it is not solving the problem.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.