Crawling
The mechanisms use a system of crawling (detailed information on Wikipedia). In the background they go browsing as if they were a user, only that it is a robot, is a software that keeps making the HTTP requests and storing the content found somewhere.
So there are basically two ways a page can enter: one is someone manually put an address to be searched for by that researcher; the other is he evaluate one link found in an HTML page or possibly some other type of resource and use this address as a basis to start another search.
Of course it is possible to develop techniques to have other forms of discovery, such as analyzing new registered domains, for example. I would have to go a lot in trial and error of what I might have on something new. I doubt it’s an interesting technique. Interestingly, in an aggressive competition strategy, it may be interesting to search for another search engine to have a reference :)
The time it takes for the content to appear depends on what you want for your search engine. In theory it is possible to do this practically real team, only that there will probably be a waste of resources. These more sophisticated mechanisms should have a way to determine whether to revisit a page with more or less frequency based on statistics.
In addition to the revisit policy the robot usually has some interesting content selection policies, which should be used (respect the content producer’s willingness not to index certain parts that are visible), and aggressiveness of requests.
Parsing
Obviously the system needs to know how to make a Parsing HTML, at least to capture the links and validate them. This activity can be used for other useful things for the searcher.
Today it is common to search beyond HTML. Search PDF, TXT and other document types, images and even run Javascript.
Very modern and complex analysis techniques are necessary for the way the web operates today. It is already necessary to do semantic analysis of the content to decide what to do with it.
Indexing
With the collected data it is stored in some form that can be consulted. In general a textual indexing is done, or inverse, as some like to call it. This is similar to what many databases have or as tools equal to Lucene work.
It will certainly be placed in a database, but probably not an SQL. It will be some kind of Nosql, most likely done for that particular need. It is not that SQL cannot be used, but it is not usually the most appropriate.
Another reason it takes time for content to appear is that database does not have to be updated all the time. There is probably a policy of updating the indexes, since new pages come in all the time and the reindexation may be too expensive to be performed all the time, so there is a policy of delaying the updating of the indexes and do in batches.
Searching
Here is the part that gives access to the data in the form that the user needs.
The way he will find it, the relevant criteria also depend on how he wishes to work. It is known that Google puts a lot of weight on the amount of links to that page.
In general the search is distributed. It is customary to use a technique of Mapreduce (not to be confused with the map()
and reduce()
that some libraries have).
Searches may be cached, which may delay the appearance of new content, but very little and in very specific cases.
Content can be exposed in various ways. Ways to filter and normalize content can be more or less sophisticated. For example, it is often easier to find content on the Stack Exchange network by Google than in the SE search. But there are cases that SE does more specialized and thinks better.
Completion
It is possible to use ready-made tools to do these tasks, but by volume it will not only be more interesting to make more suitable own, but it should also have an infrastructure for data search distribution, indexing and access to the database for consumption. If the indexing is much more restricted then it might be worth something ready.
As I know that AP works with . NET, there are tools ready for this platform that do the bulk of the work (not the crawling, although there must be something less known ready, it is not something complicated). The most known use the Weld. Are the Solrnet and the Solrsharp. Has for the Elastic Search also. (see).
Part of the challenge is the engineering work of the search itself (how to be relevant). But most of the challenge is to scale through all of this.
Google teaches how it works.
Techniques that Google uses.
Spiderweb, is a robot that scans the internet reading, collecting and filtering the sites ! Are developed in Perl, or even C and C++, systems are created that navigate the Internet in a methodical and automated way.
– Ikaro Sales