How to collect snippets of a text with the Pageobject object from the Pypdf2 library?

Question

How to collect snippets of a text with the Pageobject object from the Pypdf2 library?

Asked 6 years, 4 months ago

Viewed 610 times

0

Going remove a section of text of a PDF with the library PyPDF2 in Python. Precise find the word "Lc" inside the PDF and extract it: the word "Lc", which means leasing, comes accompanied by a number (order number) who will vary from document to document.

The solution I had imagined would be to collect the term "Lc" + the next 5 characters. I don’t know how to do it as I can see from my tests (see code below) - I’m starting out in programming and maybe I don’t know something crucial or I’m not seeing something.

The question is: how can I collect the stretch inside the object PageObject of PyPDF2? Taking into account that I also need to collect the 5 characters following the searched term as they vary from document to document, and are unknown.

import PyPDF2

reader = PyPDF2.PdfFileReader(r'C:\Users\1\Desktop\Escaneados\2.pdf'
    , 'rb')
p = reader.getPage(0)
text = p.extractText() 
search_word = "lc"

print (text[27:35])

This PDF has been generated, apparently, by scanning a physical document. Scans always generate a "photo" of the scanned document. If you use an OCR program, it transforms parts of the image with text into plain text (characters) - but there may be flaws in some sections. Question: he went through some OCR software to turn the "photo" of the document into plain text? See: if it is not a document with plain text and yes photo, it will not work. If the part of the photo containing the image of the text of interest has not been turned into plain text, it will not work.

– José

2019/03/23 at 09:42
1

Good morning Jose. the PDF was generated by a multifunction that is configured to generate searchable files through OCR. The correction you proposed is exactly the point that’s holding me here. I need to extract the following numbers from the "Lc" to be able to save these Pdfs with the order number for location. Thank you

– Diego bronstein lopes

2019/03/23 at 12:25

1 answer

Browser other questions tagged python string pdf objects

You are not signed in. Login or sign up in order to post.

by nosklo • **5,801** points · Answer 1 · 2019-03-23T13:02:39+00:00

Take the position of lc in the text:

pos = text.find('lc')

Then use this variable when getting the slice you want:

print(text[pos:pos+10])

Another option is to use regular expressions:

import re
m = re.search(r'lc\s+(\d+)', text)
if m:
    print(m.group(1))
else:
    print('not found')