When trying to extract a PDF using Python Textract, returns an error

Asked

Viewed 29 times

-2

I am using the Python library, which is Textract, to extract text from a PDF file, but is getting an error while running the script. Below is the script and the error returned in the console.

import textract

text = textract.process("C:\Users\Willian Ambisis\Downloads\licenca-ambiental2.pdf")
print(text)

Error:

File "c: Users Willian Ambisis Desktop Textract textractPython.py", line 3 text = textract.process("C: Users Willian Ambisis Downloads licenca-ambiental2.pdf") Syntaxerror: (error Unicode) 'unicodeescape' codec can’t Decode bytes in position 2-3: truncated UXXXXXXXX escape

2 answers

-1

This error is caused because you are using a normal string as a file address. There are 3 possibilities to solve:

1: Add r before the string to convert from normal to normal raw string:

text = textract.process(r"C:\Users\Willian Ambisis\Downloads\licenca-ambiental2.pdf")

2: Flip the bars:

text = textract.process("C:/Users/Willian Ambisis/Downloads/licenca-ambiental2.pdf")

3: Use double bars to escape the character:

text = textract.process("C:\\Users\\Willian Ambisis\\Downloads\\licenca-ambiental2.pdf")

-2


The problem is because you are treating the backslashes of the file path as normal characters instead of special characters. From documentation:

The backslash character (\) is used to escape characters that otherwise have a special meaning, such as a new line, the backslash itself or the quote character.

Try adding the prefix r to your string, which handles the backslashes by you:

import textract

#                       v
text = textract.process(r"C:\Users\Willian Ambisis\Downloads\licenca-ambiental2.pdf")
print(text)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.