Extract images from a pdf file using a python script

Question

Extract images from a pdf file using a python script

Asked 5 years, 4 months ago

Viewed 257 times

-3

good morning, I have this code

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import os
try:
    from textract import *
except ModuleNotFoundError:
    os.system('sudo apt-get install -y python3 python-dev python-pip build-essential swig git libpulse-dev && pip3 install pocketsphinx && pip3 install textract')
    os.system('pip3 install textract')
    from textract import *
# É inserido o ficheiro
ficheiro=input('insira o ficheiro pdf:')
#processa o ficheiro
data =process(ficheiro)
#imprime para o ecra e descodifica o texto
print (data.decode('utf8'))

The purpose of this code was to open a pdf file and from it extract text and images but it is only taking the text

Does anyone have any idea how to solve this problem?

1 answer

Browser other questions tagged python-3.x pdf

You are not signed in. Login or sign up in order to post.

by hedge-20 • 11 points · Answer 1 · 2020-04-13T09:56:11+00:00

In Python I do not know, but follows in Ubuntu:

1) Create a Makefile. Here is a PDF file named 'litle_invest' and directory 'LI'':

P1:

 mkdir -p LI
 pdfimages -all litle_invest.pdf nome da pasta/img

P2 litle_invest.txt: litle_invest.pdf

pdftotext litle_invest.pdf 
pdftotext -layout litle_invest.pdf 
pdftohtml  litle_invest.pdf
pdftohtml -xml  litle_invest.pdf

2) On the command line:

$ make P1

$ make P2 (in this case, 'litle_invest.txt: litle_invest.pdf' is all in the same line, different from P1)

3) The result is all images extracted from the PDF to the 'LI' directory, and created versions in txt, html and xml.

4) I hope I have helped.