How to capture CMC7 code with the Tesseract API?

Asked

Viewed 629 times

2

To contextualize my problem, I am reading characters in images using the Tesseract API for Java, tess4j. More specifically, the images are of bank checks, where I need to capture the code CMC7. What happens is that the API cannot recognize the source type of the code. I performed several researches, implemented the code that makes the reading, but I was not successful. Follows:

Image for reading:inserir a descrição da imagem aqui

Code:

public static void main(String[] args) {
    try {
        File imageFile = new File("D:/teste.png");
        ITesseract instance = new Tesseract(); // JNA Interface Mapping
        instance.setLanguage("mcr");
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (Exception e) {
        System.err.println(e.getMessage());
    }
}

The source file used is the mcr.traineddata.

After reading the above image, I get returned the following code: d8d0225255dd5582251558825 851581288882888888588111185801126911855888 212858801168185865810165125812086510.

So what do I do ?

  • What do you mean? What background? Only with this information can not help you. Edit the question and add more details, if possible, with the code snippet that generated the doubt.

  • @Lucas you Achieve to capture CMC7 Finally ?

2 answers

1

There are some Engins that perform this hard work of image manipulation, making the extraction of your characters a relatively simple task. The best known is the Tesseract, but it was not developed in Java. For this reason, we will use a JNA wrapper called Tess4j, which allows us to run the native methods of this engine from Java.

Link to download TESS4J:

https://sourceforge.net/projects/tess4j/? source=typ_redirect

  1. Downloading the Tess4j Go to the Tess4j project page and download the latest version.

  2. Configuring the libraries Unzip the files below in the lib folder of your project:

    Win32-x86/ Win32-x86-64/ with it-2.4. jar ghost4j-0.5.1.jar jai_imageio.jar jna-4.1.0.jar junit-4.10.jar log4j-1-2-17.jar tess4j.jar

Also unzip the tessdata folder at the root of your project:

thessaly/

  1. Writing the code to read the images As an example, I will use a scanned page I found through Google Images.
package br.com.danilotl.ocr;

import java.io.File;
import net.sourceforge.tess4j.*;

public class ReadImage {

    public static void main(String[] args){ 

        File imageFile = new File("page.jpg");
        Tesseract instance = Tesseract.getInstance();
        instance.setLanguage("eng");

        try {
            String result = instance.doOCR(imageFile);
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }
    }
}

Let’s look at the main points of the code above:

import java.io.File;
import net.sourceforge.tess4j.*;

Here we perform the Imports of the java.io.File class, responsible for creating a representation of the image file, and of the Tess4j classes, necessary to be able to use the methods of your API.

File imageFile = new File("page.jpg");

Here we create an object of type File, passing in its constructor the path from where the image is located. In this case, the page.jpg file is at the root of the project.

Tesseract instance = Tesseract.getInstance();
instance.setLanguage("eng");

Here we receive an instance of the Tesseract class, and then define the language in which the text of our image is written. In this case, the text of our image is in English. If you need to read other languages (such as English, which has accented characters, for example), you should download the language file in question in the Downloads section of the Tesseract page, unpack the file inside the tessdata folder, and set in your code the corresponding language.

try {
    String result = instance.doOCR(imageFile);
    System.out.println(result);
} catch (TesseractException e) {
    System.err.println(e.getMessage());
}

Finally, we read the image through the doOCR() method, passing the image as argument, and then displaying the output in the Console. As we can compare, the reading is very accurate and contains very few errors.

This information is contained in the link below:

http://danilotl.com.br/blog/reconhecendo-caracteres-em-imagens-com-java-e-tess4j/

  • Do you have information somewhere on your blog that allows you to copy its content elsewhere? If not, could you give an answer that is your own? Even if based on this article.

0

I recommend you before sending to the Tesseract perform the image reading do an image processing, there are some tools you may be using that even on own Wiki of the Tesseract recommends being used to increase the quality/accuracy of the recognition.

If your images have this pattern (Gray and only numbers) maybe using an image processing type Limiarization and passing as parameter to the Tesseract to be identifying only number you will have a very significant improvement.

Based on a project made in C# where you may be manipulating the image and then passing to the Tesseract recognize, you can test the design to understand its functioning a little better. In the case of this project it has the following code:

private string OCR(Bitmap b)
        {
            string res = "";
            using (var engine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default))
            {
                engine.SetVariable("tessedit_char_whitelist", "1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ");
                engine.SetVariable("tessedit_unrej_any_wd", true);

                using (var page = engine.Process(b, PageSegMode.SingleLine))
                    res = page.GetText();
            }
            return res;
        }

And one of the variables you can assemble the "dictionary" of the Tesseract, which in your case use only numbers you could be setting as follows:

engine.SetVariable("tessedit_char_whitelist", "1234567890");

I believe that taking the image and making the Tesseract recognize the characters will not be very successful.

In that article cites some essential points for the characters to be extracted that are:

  • Two color channels only (black and white). Be it in scale gray (0 Vi 255) or binary image (Vi== 0 || Vi== 255). Vi=Intensity value.
  • Aligned/standardized, noiseless text (usually generated during binarization stage).
  • Box height (space occupied by characters) greater than the minimum of 10px.
  • Ideal density of 300dpi, or proportional to the above assumption.
  • Have the extendable text in a single alphabet pattern (or language).
  • No useless space, considered as borders to the text.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.