How does a search engine work?

Asked

Viewed 1,189 times

28

I am making it clear that I am not asking how to make my site appear better in the search index.

Today we have several search engines. Some very well known as Google and Bing and others not so much, as the In.

These sites perform a "scan" on the internet and list the sites by the content searched, as far as we all know.

My question is: How to Develop a Search Engine?

Let’s say I want to develop my own search engine to compete with Google, how should I proceed? Where to find the data of the sites to search?

Using Google as an example, when creating a website we realize that it does not appear at the same time in the searches (I don’t mean the DNS but the searches made in Google), and that we need to wait a certain time. Why is this time? Data is saved in some database?

  • Spiderweb, is a robot that scans the internet reading, collecting and filtering the sites ! Are developed in Perl, or even C and C++, systems are created that navigate the Internet in a methodical and automated way.

4 answers

26


Crawling

The mechanisms use a system of crawling (detailed information on Wikipedia). In the background they go browsing as if they were a user, only that it is a robot, is a software that keeps making the HTTP requests and storing the content found somewhere.

So there are basically two ways a page can enter: one is someone manually put an address to be searched for by that researcher; the other is he evaluate one link found in an HTML page or possibly some other type of resource and use this address as a basis to start another search.

Of course it is possible to develop techniques to have other forms of discovery, such as analyzing new registered domains, for example. I would have to go a lot in trial and error of what I might have on something new. I doubt it’s an interesting technique. Interestingly, in an aggressive competition strategy, it may be interesting to search for another search engine to have a reference :)

The time it takes for the content to appear depends on what you want for your search engine. In theory it is possible to do this practically real team, only that there will probably be a waste of resources. These more sophisticated mechanisms should have a way to determine whether to revisit a page with more or less frequency based on statistics.

In addition to the revisit policy the robot usually has some interesting content selection policies, which should be used (respect the content producer’s willingness not to index certain parts that are visible), and aggressiveness of requests.

Parsing

Obviously the system needs to know how to make a Parsing HTML, at least to capture the links and validate them. This activity can be used for other useful things for the searcher.

Today it is common to search beyond HTML. Search PDF, TXT and other document types, images and even run Javascript.

Very modern and complex analysis techniques are necessary for the way the web operates today. It is already necessary to do semantic analysis of the content to decide what to do with it.

Indexing

With the collected data it is stored in some form that can be consulted. In general a textual indexing is done, or inverse, as some like to call it. This is similar to what many databases have or as tools equal to Lucene work.

It will certainly be placed in a database, but probably not an SQL. It will be some kind of Nosql, most likely done for that particular need. It is not that SQL cannot be used, but it is not usually the most appropriate.

Another reason it takes time for content to appear is that database does not have to be updated all the time. There is probably a policy of updating the indexes, since new pages come in all the time and the reindexation may be too expensive to be performed all the time, so there is a policy of delaying the updating of the indexes and do in batches.

Searching

Here is the part that gives access to the data in the form that the user needs.

The way he will find it, the relevant criteria also depend on how he wishes to work. It is known that Google puts a lot of weight on the amount of links to that page.

In general the search is distributed. It is customary to use a technique of Mapreduce (not to be confused with the map() and reduce() that some libraries have).

Searches may be cached, which may delay the appearance of new content, but very little and in very specific cases.

Content can be exposed in various ways. Ways to filter and normalize content can be more or less sophisticated. For example, it is often easier to find content on the Stack Exchange network by Google than in the SE search. But there are cases that SE does more specialized and thinks better.

Completion

It is possible to use ready-made tools to do these tasks, but by volume it will not only be more interesting to make more suitable own, but it should also have an infrastructure for data search distribution, indexing and access to the database for consumption. If the indexing is much more restricted then it might be worth something ready.

As I know that AP works with . NET, there are tools ready for this platform that do the bulk of the work (not the crawling, although there must be something less known ready, it is not something complicated). The most known use the Weld. Are the Solrnet and the Solrsharp. Has for the Elastic Search also. (see).

Part of the challenge is the engineering work of the search itself (how to be relevant). But most of the challenge is to scale through all of this.

Google teaches how it works.

Techniques that Google uses.

10

I understand there are three parties involved:

  • a Crawler, which navigates the pages retrieving content and identifying references to other pages;
  • an indexer, which evaluates the content of the page to identify the content and its relevance;
  • a search engine, which retrieves the pages that meet the search content and prioritize the most relevant;

Crawler has to regularly visit the pages to update the content and discover new pages to recover. The delay between your page being published and it appearing on a search engine is dependent on that refresh interval and your page appearing on any of the sites Crawler already knows about.

  • 1

    A simple out to understand, thanks for the reply.

4

Randrade, to accomplish this work, you will need two tools, a Crawler (a robot that scans (crawls) the internet for information) and an Indexer (it is that will organize and catalog the information collected by Crawler).

Possibly the most famous tool that makes these two roles is the Apache Lucene/SOLR, but you can use it only for the Crawler function and other tool to index, for example the SQLServer ou PostgreSQL (usando Full text search)

You can also use other options for Crawler, I believe that in a quick search on the internet you will find the main competitors of Lucene.

Finally you can look at the following demo project by Microsoft

  • I didn’t know these tools, but they interested me a lot. Thanks for the contribution

0

The Crawler web is an essential part of the search engine in the case of web applications, of course. The https://nutch.apache.org/ is an example of scalable web Crawler, by the way, is a project derived from Lucene, which started Hadoop many ago, written by the same group of people. Nutch is integrated into Solr, but has no mystery to integrate it with other tools.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.