There are some Engins that perform this hard work of image manipulation, making the extraction of your characters a relatively simple task. The best known is the Tesseract, but it was not developed in Java. For this reason, we will use a JNA wrapper called Tess4j, which allows us to run the native methods of this engine from Java.
Link to download TESS4J:
https://sourceforge.net/projects/tess4j/? source=typ_redirect
Downloading the Tess4j
Go to the Tess4j project page and download the latest version.
Configuring the libraries
Unzip the files below in the lib folder of your project:
Win32-x86/
Win32-x86-64/
with it-2.4. jar
ghost4j-0.5.1.jar
jai_imageio.jar
jna-4.1.0.jar
junit-4.10.jar
log4j-1-2-17.jar
tess4j.jar
Also unzip the tessdata folder at the root of your project:
thessaly/
- Writing the code to read the images
As an example, I will use a scanned page I found through Google Images.
package br.com.danilotl.ocr;
import java.io.File;
import net.sourceforge.tess4j.*;
public class ReadImage {
public static void main(String[] args){
File imageFile = new File("page.jpg");
Tesseract instance = Tesseract.getInstance();
instance.setLanguage("eng");
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
Let’s look at the main points of the code above:
import java.io.File;
import net.sourceforge.tess4j.*;
Here we perform the Imports of the java.io.File class, responsible for creating a representation of the image file, and of the Tess4j classes, necessary to be able to use the methods of your API.
File imageFile = new File("page.jpg");
Here we create an object of type File, passing in its constructor the path from where the image is located. In this case, the page.jpg file is at the root of the project.
Tesseract instance = Tesseract.getInstance();
instance.setLanguage("eng");
Here we receive an instance of the Tesseract class, and then define the language in which the text of our image is written. In this case, the text of our image is in English. If you need to read other languages (such as English, which has accented characters, for example), you should download the language file in question in the Downloads section of the Tesseract page, unpack the file inside the tessdata folder, and set in your code the corresponding language.
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
Finally, we read the image through the doOCR() method, passing the image as argument, and then displaying the output in the Console. As we can compare, the reading is very accurate and contains very few errors.
This information is contained in the link below:
http://danilotl.com.br/blog/reconhecendo-caracteres-em-imagens-com-java-e-tess4j/
What do you mean? What background? Only with this information can not help you. Edit the question and add more details, if possible, with the code snippet that generated the doubt.
– user28595
@Lucas you Achieve to capture CMC7 Finally ?
– flaviussn