I can’t tell you how it works exactly the specific Google algorithm, but I can offer a help on how it is the basic principle of image search by content.
This area is commonly called CBIR (Content-Based Image Retrieval, or Content-Based Image Recovery in Portuguese), as it involves searching images from their content. It is an area of great interest but difficult to solve because the images are difficult to interpret by computers (that’s why it is widely used CAPTCHAS to prevent automated access to websites).
There are some traditional approaches to solving this problem.
1. Note Based Search
The idea of annotating images with descriptive labels of their content is the simplest. You just create a database containing the images and text (labels) that describe it. Imagine, for example, the image of a puppy. The database may contain for them the words "puppy", "dog", "labrador" (if applicable), and so on. The search would then be performed based on the annotations, with the returned images being those whose annotations best correspond to the search term. I suppose this is the main algorithm used by Google, when you search for images using text expressions.
The great difficulty of this method is that it is laborious and tedious to make manual annotations of the images. Therefore, there are partially automated solutions. Google, very likely, uses human reinforcement from the searches made. You type in any text ("puppy", in the previous example - a word that does not yet appear as an annotation) and it displays more and less "relevant" images based on the links between them. If you click on a less relevant image (which does not yet have this label), this may be considered an indication that you, as a human, have validated that this label is also applicable. If more people do the same, then Google can automatically add such a label to the image, and increase its relevance as these specific searches occur.
Another idea is to use sources of labeling maintained by humans, such as Flickr and/or the Pinterest. Although it is a boring task, those who like photography take pleasure in doing so in this context. In addition, there is Community work on labelling, which facilitates the existence of new labels.
There is also the idea of using games to automate labeling. The idea is very simple, but very clever: with a guessing game, two people (who don’t necessarily know each other) play on the Internet. The game simply features a photo (from a Labrador pup, for example) to one of the players, and the other player (who does not see the photo) needs to guess based on questions that can only be answered with yes or no by the playmate. The questions (e.g., "It is a animal?) that are being answered as yes, become labels automatically, and relevance arises from the number of occurrences between different matches. I unfortunately lost the reference of this game, but there are numerous others like the ESP-Game (which is very similar) and the Name-It-Game (who notes by region).
2. Search Based on Similarity
One of the problems of the previous approach is that even though the images are very well labeled there may still be important semantic difficulties. A classic example is the search for the term "pig". This term can return images that are labeled correctly, but are very distinct and do not necessarily represent what the user wanted at the time of the search. Here is a classic example (here in Brazil at least):
The idea of this second approach is to improve the search avoiding the use of text, but searching for images that are similar a reference image provided by the user (feature that Google Images also has available).
The great difficulty of this approach is precisely to define similarity. There are three traditional forms (and I suppose Google uses a mixture of the first two): similarity by color, similarity by texture and location of reference points.
In color similarity the algorithm can perform a numerical measurement of the colors in the reference image, and then search in a database where such measurement is already pre-calculated (after all, making this calculation at each survey is computationally impractical). There are numerous algorithms, but the simplest is as follows (the original source of this suggestion is this answer in the SOEN):
- Scale images to a small size (64x64, for example, so details are less important in comparison).
- Scale the colors for the same range (so that both images have the darkest color as black and the lightest color as white).
- Rotate and invert the images so that the lightest and darkest corners are in standardized locations (lighter in the upper left and darker in the lower right corner, for example, so that both are in the same orientation).
- Then, take the difference between each pixel and calculate the average difference. The closer to zero, the more similar the images.
It is also common to combine color comparison with the locations and/or format in which they occur. Since the average difference from the previous algorithm provides a single value for the entire image, the comparison will result in something like you said in the question: compare by the general predominance of color. In this case, an alternative is - rather than having a single average difference value - to have a quadtree with mean values for different regions (and sub regions) of the image. Thus, the comparison by similarity will still be by predominance of colors, but in specific regions in order to generate potentially more accurate results.
In the not too distant past there was a commercial product from IBM called QBIC, that came to be used in museums to facilitate the search for pictures: you only had to draw more or less the picture you wanted (light blue sky, dark blue sea, yellow sand, for example) and he found similar images. Unfortunately I haven’t found museums that still use this technology, but there’s this example of search engine named Retrievr who does exactly the same, looking for images similar to the one you draw on Flickr. :)
There is also another algorithm widely used to define similarity: the comparison of color histogram standard. A histogram It is nothing more than a frequency distribution, that is, of the number of occurrences of something (each pixel value, for example) in a context (in the image, for example). The color histogram does not separate occurrences by pixel value, but rather by each color in a color sequence (which you define yourself, it does not need to be RGB only). Similar images have this very similar distribution, and the advantage is that when the histogram is normalized (that is, transformed into percent of each color, so as not to matter the total number of pixels) this makes the comparison independent of rotation and scale in the images. See this other SOEN response for more details on the implementation (in it there is a reference to slides with help to get histograms in Opencv, for example).
So you can implement this algorithm as follows:
- Calculate the color histogram for both images (on each image, count the pixels and accumulate that count into the color "categories" you set - on example from Wikipedia pixels are counted for each RGB combination, but with pixel values scaled to [0 to 3] for ease - that is, decrease the table size). Of course, in the case of images in the database these values may (should! ) be pre-calculated.
- Normalize the values in each category by dividing the value by the total number of pixels. This will result in the value in percentage format, easier to compare.
- Compare the histograms of the two images by measuring, for example, the average error rate. For each category, take the difference between the values in the two images. Then, take the average difference between all categories. The more similar the image, the closer to zero this difference will be.
The similarity by texture seeks to define the (mathematical) pattern of the texture of the image. You can imagine something like the "light-dark-light-dark-..." sequence that occurs in an image of a tiled floor, for example. Something widely used in this case are the Filtros de Gabor. These filters are very sensitive to variations of edges in the image (edges are important details) in different orientations (vertical, horizontal, diagonal, etc.). These filters are considered to be very close to how the human eye responds to the same visual stimuli, but regardless of this the fact is that they are very useful to describe the texture of an image.
In the blog by Patrick Fuller there is a fantastic tutorial (in English)
6 parts that will explain you very easily (and with examples in
C++) the principle of image processing until you reach the Filter
Gabor. Worth the read!
Thus, it is possible to extract this pattern from an image to then use it in searches. It is also very common to use algorithms of machine learning to learn the pattern of a set of sample images and use them in the classification of new images (as being of something "already known"). This python example demonstrates exactly that, using the library Scikit-Learn. He learns the pattern of images of bricks, grass and wall, and uses it in function match
to compare texture characteristics (Features) of two images (using the square minima to set the "error" - that the closer to zero, the more similar the images are).
Finally, there is also the possibility of comparing the images directly in terms of the detail of their content, in order to try to find images that are similar in terms of these occurrences. The idea is to first find in the images the edge characteristics (Features) more relevant, and then check whether the same characteristics occur in the two images (in the reference one and in the one being compared in the search).
This approach is mostly used to overlay partial images (to automatically merge them and create a 360º panorama, for example), but could also be used to compare content in a search. There are ready examples in Opencv, both for identify the most relevant characteristics, how to make your location between images (matching):