Limit characters in Tesseract Portable

Asked

Viewed 96 times

3

Currently I am using the Tesseract Portable integrated with java to be able to identify some characters, but I’m facing some problems like:

Some fields date only as : 01/02/2013

Something like this appears: 0Il0S/S013

It just doesn’t follow any pattern. Does anyone have any idea if they can create a standard dictionary only for characters like 0-9 and /?

Remembering: I know exists for C, only that the version Portable not found yet.

1 answer

1

I’ve only been using the tesseract on Linux, via command line, or in scripts that send the command line do the job...

1) create a configuration file mydata with the valid characters:

tessedit_char_whitelist 0123456789/-

2) then invoke Tesseract as:

tesseract f.png zzz   mydata

producing zzz.txt only with digits and '/' and '-'

For good results it is worth investing in the quality (resolution) of the initial image...

If the scope is wider it is probably useful to indicate the language.

It is natural that the Java interface, C, etc have functionality to define as "whitelists".

There is also the possibility to retrain tesseracts (I doubt it is justified).

Browser other questions tagged

You are not signed in. Login or sign up in order to post.