Problems reading a PDF with TESS4J image

Asked

Viewed 507 times

0

I recently started to develop a small executable jar that converts PDF to text files and it will work in Windows environment.

Using TESS4J 3.3.1, I developed the following process:

A) The user can choose to insert a PDF or an image;

B) If it is a PDF, the system will convert to image using GHOST4J;

C) The image will be converted to text using TESS4J.

For most of the tested files the program worked correctly, but when I inserted a fiscal note file (in PDF) with a logo, the program (in point C) can not convert even 10% of the image in text.

import java.awt.Image;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import javax.imageio.ImageIO;

import org.ghost4j.document.PDFDocument;
import org.ghost4j.renderer.SimpleRenderer;

public class PDFToImage {

    private static final SimpleDateFormat sdf = new SimpleDateFormat("ddMMyyyy_HHmmss");

    public static List<File> convert(File filePDF) throws Exception{
        PDFDocument document = new PDFDocument();
        try {
            document.load( new FileInputStream( filePDF ) );
        } catch (IOException e) {
            throw e;
        }

        SimpleRenderer renderer = new SimpleRenderer();
        renderer.setResolution( 300 );

        List<Image> renderedImageList = null;
        try {
            renderedImageList = renderer.render(document);
        } catch (Exception e) {
            throw e;
        }

        List<File> fileImageList = new ArrayList<File>();
        try {
            for( Image i : renderedImageList ){
                File f = new File( "C:\\Users\\story\\Desktop\\ocr_test" + File.separator + filePDF.getName() + "_" + renderedImageList.indexOf( i ) + sdf.format( new Date() ) + ".png" ); 
                ImageIO.write((RenderedImage) i, "png", f);
                fileImageList.add( f );
            }
        } catch (Exception e) {
            throw e;
        }

        return fileImageList;
    }

}

Test file:

import java.io.File;
import java.util.List;

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class Basic  {

    // Teste: A, B e C
//  public static void main(String[] args) throws Exception {
//      File pdfFile = new File("C:\\Users\\story\\Desktop\\ocr_test\\source_pdf.pdf");
//
//      List<File> imageList = PDFToImage.convert(pdfFile);
//
//      ITesseract instance = new Tesseract();
//      instance.setLanguage("eng");
//      instance.setDatapath("C:\\Users\\story\\Desktop\\ocr_test\\tessdata");
//
//      for( File i : imageList ){
//          try {
//              String result = instance.doOCR( i );
//              System.out.println(result);
//          } catch (TesseractException e) {
//              System.err.println(e.getMessage());
//          }
//      }
//  }

    // Teste: B e C
    public static void main(String[] args) throws Exception {
        ITesseract instance = new Tesseract();
        instance.setLanguage("eng");
        instance.setDatapath("C:\\Users\\story\\Desktop\\ocr_test\\tessdata");
        try {
            String result = instance.doOCR( new File("C:\\Users\\story\\Desktop\\ocr_test\\source_png_split.png") );
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }
    }

}

PDF image with problem:

Cabeçalho do PDF com problema, se eu remover esse logotipo ele funciona perfeitamente

If I remove (in the same page) this logo, the image is converted perfectly! In this case I’m having doubts:

1) In TESS4J: there is a way to prevent this error?

2) In GHOST4J: Is there any way not to convert this image in PDF to the final image?

1 answer

0

I’ve solved the problem! After researching a little more on Google, I modified a little the class Pdftoimage.java I was able to solve the problem in two different ways:

package core;

import java.awt.Image;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.ResourceBundle;

import javax.imageio.ImageIO;

import org.ghost4j.document.PDFDocument;
import org.ghost4j.renderer.SimpleRenderer;

import util.Utils;

public class PDFToImage {

    private static final ResourceBundle properties = ResourceBundle.getBundle( "properties/configuration" );
    private static final SimpleDateFormat sdf = new SimpleDateFormat("ddMMyyyy_HHmmss");

    @SuppressWarnings("rawtypes")
    public static List<File> convert(File preFilePDF, Class clazz) throws Exception {
        // Inicio trecho adicionado
        File filePDF = preFilePDF;

        if( Boolean.parseBoolean( properties.getString("PDF_STAMP_IMAGE") ) ){
            filePDF = PDFStamper.convert( preFilePDF, clazz);
        }

        if( Boolean.parseBoolean( properties.getString("PDF_REMOVE_IMAGE") ) ){
            filePDF = PDFRemoveImage.convert( preFilePDF );
        }
        // Fim trecho adicionado

        PDFDocument document = new PDFDocument();
        try {
            document.load( new FileInputStream( filePDF ) );
        } catch (IOException e) {
            throw e;
        }

        SimpleRenderer renderer = new SimpleRenderer();
        renderer.setResolution( 300 );

        List<Image> renderedImageList = null;
        try {
            renderedImageList = renderer.render(document);
        } catch (Exception e) {
            throw e;
        }

        if( !filePDF.canExecute() 
                && !filePDF.canExecute()
                && !filePDF.canRead() ){
            throw new Exception("Sem permissão na pasta "+filePDF.getAbsolutePath());
        }

        List<File> fileImageList = new ArrayList<File>();
        try {
            for( Image i : renderedImageList ){
                File f = new File( "C:\\Users\\story\\Desktop\\ocr_test" + File.separator + filePDF.getName() + "_" + renderedImageList.indexOf( i ) + sdf.format( new Date() ) + ".png" ); 
                ImageIO.write((RenderedImage) i, "png", f);
                fileImageList.add( f );
            }
        } catch (Exception e) {
            throw e;
        }

        return fileImageList;
    }

}

1) Removing all PDF images using Pdfbox

At first this method seemed to be the final solution, because the output (in PDF) was perfect, but when converting the resulting PDF into image through GHOST4J, the file lost its settings and formatting, losing some important characters like CPF / CNPJ and also losing all the special characters.

package core;

import java.io.File;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdfwriter.ContentStreamWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFOperator;

public class PDFRemoveImage {

    private static final SimpleDateFormat sdf = new SimpleDateFormat("ddMMyyyy_HHmmss");

    @SuppressWarnings("rawtypes")
    public static File convert(File in) throws Exception {
        String out = "C:\\Users\\story\\Desktop\\ocr_test" + File.separator + in.getName() + "_" + sdf.format( new Date() ) + ".pdf";

        PDDocument doc = PDDocument.load(in);

        List pages = doc.getDocumentCatalog().getAllPages();
        for( int i=0; i<pages.size(); i++ ) {
            PDPage page = (PDPage)pages.get( i );

            COSDictionary newDictionary = new COSDictionary(page.getCOSDictionary());

            PDFStreamParser parser = new PDFStreamParser(page.getContents());
            parser.parse();
            List tokens = parser.getTokens();
            List newTokens = new ArrayList();
            for(int j=0; j<tokens.size(); j++) {
                Object token = tokens.get( j );

                if( token instanceof PDFOperator ) {
                    PDFOperator op = (PDFOperator)token;
                    if( op.getOperation().equals( "Do") ) {
                        COSName name = (COSName)newTokens.remove( newTokens.size() -1 );
                        deleteObject(newDictionary, name);
                        System.out.println( name.getName() );
                        continue;
                    }
                }
                newTokens.add( token );
            }
            PDStream newContents = new PDStream( doc );
            ContentStreamWriter writer = new ContentStreamWriter( newContents.createOutputStream() );
            writer.writeTokens( newTokens );
            newContents.addCompression();

            page.setContents( newContents );

            PDResources newResources = new PDResources(newDictionary);
            page.setResources(newResources);
        }

        doc.save(out);
        doc.close();

        return new File( out );
    }

    private static boolean deleteObject(COSDictionary d, COSName name) {
        for(COSName key : d.keySet()) {
            if( name.equals(key) ) {
                d.removeItem(key);
                return true;
            }
            COSBase object = d.getDictionaryObject(key); 
            if(object instanceof COSDictionary) {
                if( deleteObject((COSDictionary)object, name) ) {
                    return true;
                }
            }
        }
        return false;
    }
}

2) Placing an image on top of the PDF using iText

After some more time, I arrived at this solution that put an image on the image with problem, I opted for a black square and the rest of the program worked perfectly!

package core;

import java.io.File;
import java.io.FileOutputStream;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.ResourceBundle;

import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfStamper;

public class PDFStamper {

    private static final SimpleDateFormat sdf = new SimpleDateFormat("ddMMyyyy_HHmmss");
    private static final ResourceBundle properties = ResourceBundle.getBundle("properties.configuration");

    @SuppressWarnings("rawtypes")
    public static File convert(File in, Class clazz) throws Exception {
        File out = new File( "C:\\Users\\story\\Desktop\\ocr_test" + File.separator + in.getName() + "_" + sdf.format( new Date() ) + ".pdf" );
        try {
            PdfReader pdfReader = new PdfReader( in.getAbsolutePath() );

            PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileOutputStream(out));

            Image image = Image.getInstance( "C:\\Users\\story\\Desktop\\ocr_test" + File.separator + "replacer.png" );
            for(int i=1; i<= pdfReader.getNumberOfPages(); i++){
                PdfContentByte content = pdfStamper.getOverContent(i);
                if( properties.getString("PDF_STAMP_METHOD").equals("SIMPLE") ){
                    image.setAbsolutePosition(40f, 725f);
                } else if( properties.getString("PDF_STAMP_METHOD").equals("TEMPLATE") ){
                    image.setAbsolutePosition(0f, 0f);
                }
                content.addImage(image);
            }

            pdfStamper.close();

            return out;
        } catch (Exception e) {
            e.printStackTrace();
            throw e;
        }
    }
}

It is worth noting that I adopted the second option as definitive but with a configuration that can be evaluated by whom used: in my case I only had problems with only one image at a fixed point but if Voce will read several files with different layouts, Voce can use templates, creating a Replacer.png of your PDF size.

Remarks:

  • For improvements I would like to implement a method to separate the Pdfs files in some way for the use of the template;
  • Or you would also try to take all the PDF images and replace them with Replacer, but with exactly your height, width and on-screen positioning.
  • Or you would also try to improve the use of Pdfbox by removing only images without side effects to the PDF or process.
  • I plan to make the full code available on GITHUB soon, when I do so, I will leave the link here.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.