0
Going remove a section of text of a PDF with the library PyPDF2
in Python. Precise find the word "Lc" inside the PDF and extract it: the word "Lc", which means leasing, comes accompanied by a number (order number) who will vary from document to document.
The solution I had imagined would be to collect the term "Lc" + the next 5 characters. I don’t know how to do it as I can see from my tests (see code below) - I’m starting out in programming and maybe I don’t know something crucial or I’m not seeing something.
The question is: how can I collect the stretch inside the object PageObject
of PyPDF2
? Taking into account that I also need to collect the 5 characters following the searched term as they vary from document to document, and are unknown.
import PyPDF2
reader = PyPDF2.PdfFileReader(r'C:\Users\1\Desktop\Escaneados\2.pdf'
, 'rb')
p = reader.getPage(0)
text = p.extractText()
search_word = "lc"
print (text[27:35])
This PDF has been generated, apparently, by scanning a physical document. Scans always generate a "photo" of the scanned document. If you use an OCR program, it transforms parts of the image with text into plain text (characters) - but there may be flaws in some sections. Question: he went through some OCR software to turn the "photo" of the document into plain text? See: if it is not a document with plain text and yes photo, it will not work. If the part of the photo containing the image of the text of interest has not been turned into plain text, it will not work.
– José
Good morning Jose. the PDF was generated by a multifunction that is configured to generate searchable files through OCR. The correction you proposed is exactly the point that’s holding me here. I need to extract the following numbers from the "Lc" to be able to save these Pdfs with the order number for location. Thank you
– Diego bronstein lopes