Making an OCR with no dependencies in PHP

Asked

Viewed 7,963 times

6

I have a project where people schedule phrases for Tumblr. Today it works from the Kindle and I’m already seeing the Kobo files. But today I come to ask for help for the third part: I would like to add a ocr reader, so that the person could upload a photo, the reader could extract the phrase and the person could add to the sentence posting queue.

I am very beginner in PHP and suffer a little, and researching I did not have much success. I found a plugin that identifies only letter by letter or a plugin that needs to run a exec, which is not interesting because the person may not have the program installed and for obvious security reasons.

Should I try this with another language? I’m betting on PHP first because I have a base, then because the project is done in php, my hosting is php and because I know that it has a library to handle images, so I assumed that OCR would not be so far away.

Do you have any plugin or tutorial to indicate?

3 answers

6


If you have the possibility of installing on your server, Tesseract is an opensource OCR engine provided by Google. It has an wrapper for PHP.

Otherwise, you can try webservices like the Google Docs OCR.

  • Sorry, but then I wouldn’t have to install Tesseractocr on my server? Or call with exec? I don’t have permissions for either. Or I just copy the Tesseractocr "api" folder on my server?

  • 1

    @Marta This is a very limiting factor. You can try webservices like Google Docs OCR: https://developers.google.com/google-apps/documents-list/#uploading_documents_using_optical_character_recognition_ocr

  • Thanks @Onosendai. I like my hosting but the difficulty in installing things really has limited me a lot.

  • 3

    @Marta, a suggestion, why not hire a cheap VPS (Linode, Digital Ocean or AWS and etc). Implements a service that uses the Tesseract on it (as a Web Service) and uses from your current hosting?

  • I looked into it and I think it will be the case. Thank you.

  • @Marta good luck - and feel at ease in case of any doubt.

Show 1 more comment

3

Humm without having an image processing module for PHP installed on the server you will suffer a little, take a look if the place you host has the module Imagemagick, create a phpinfo(); just to check what modules you have installed for your PHP. If the ImageMagick have available things improve but do not get less complicated.

I can only guide you with the necessary steps, I don’t know if there is something ready, the challenge seems to be more interesting than just copying something from someone, you will need some mathematical knowledge, linear algebra and if you want an algorithm really close to perfection you’ll need neural networks.

Let’s start with the most basic method possible:

  • create vectors with the patterns of all letters and numbers, you will need to crop each letter and number, extract the pixels of each one, use Imagemagick if available, store in the way you find convenient (txt, mysql).
  • Now you already have the basis for comparison, you will want to compare sentences/texts/words with the extracted patterns, again use if available Imagemagick to crop each letter of your texts, computationally speaking you go comparing horizontally each pixel until you find the beginning and end of each letter, we are talking about something basic here, so 99% of the texts are in black with white background, then walk until the white pixel end mark position and walk until the black pixel end mark position, this will tell you where to crop each letter or number (start and end).
  • Perfect cut out the letter of the text, now extracts the pixels from it, as well as in the first step made to build your database.
  • Now compare what was extracted from the text with your database, in linear algebra has a concept called espaço linear in this case we will have which pixels appear most frequently, it is a simple way that can be used to measural which is the most similar letter.
  • Mount each word based on this rank (the larger the cosine returned by linear space the better)

Well there is a basic way to build an OCR with your own hands, without relying on third party modules (except Imagemagick, used here to crop and extract pixels).

  • 1

    Whoa what a beautiful answer! Thank you! I have no knowledge currently for so much, but one day I still try to build something in this sense :)

2

You can use the class php OCR that makes learning and text recognition in images totally in PHP, so you don’t need to install anything on your server.

  • I tried to use this class but it doesn’t suit me. "It can recognize text in Monochrome Graphical images after a training Phase." , that is, I need to "train" the algorithm to recognize letter by letter. It doesn’t work with long text. But thanks for the tip :)

  • @Marta is having a misunderstanding. Training is always necessary because the solution uses neural networks. Now after training, the class will recognize letters even with a different type than those used in training. It doesn’t matter if the text is long or short. The more you train, the better the class will work. Now don’t expect any solution that works 100% in all cases because it doesn’t exist.

  • Yes, I have worked with OCR on the PC before and I know it is common to have small errors like j, i, l, for example. But I had not understood how I was going to train in this class all possible letters, with all possible fonts, to do what I need. As was said in another reply, I would have to compare the vectors of the letters with the characters themselves so this class would help; but in my case, using a webservice is the most appropriate. Thanks for your help :)

  • 1

    So that’s the thing, you don’t have to train with all the possible sources because that’s not how neural network algorithms work. If you’ve already solved your problem with a Webservice, that’s fine, but just so you understand this Webservice has probably also been trained in the same way. That is it will have an identical solution, only depending on a Webservice that can sometimes be unavailable when you need.

  • 1

    I agree with you. I think I need to study this better.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.