How to detect if my site has been visited by a search engine?

Asked

Viewed 254 times

0

I am using php and saw something about the variable $_SERVER['HTTP_USER_AGENT'], but I do not know how to detect the visit of all search engines.
I would like to detect any search bot and send to these bots the information they need via http header. That is, my site will not have a physical robots.txt file.

  • Search engines take care of looking at the robots.txt themselves, if that is their goal.

  • That’s right. Using stristr(), for example, you search by bot name googlebot to Google, msnbot for MSN and Slurp pro... slurp, that’s Yahoo’s bot!.

  • It turns out I won’t have a real robots.txt on my site... I will generate a via http header.

2 answers

2


The most complete and valid option I’ve found so far is this:

function isBot(){
    if( isSet($_SERVER['HTTP_USER_AGENT']) && preg_match('/bot|crawl|slurp|spider/i', $_SERVER['HTTP_USER_AGENT']) ){
        return TRUE;
    }
    else{
        return FALSE;
    }
}

1

If you use $_SERVER['HTTP_USER_AGENT'], it means you want to put a test on each page. Type:

 $a = $_SERVER['HTTP_USER_AGENT'];
 if ($a == motor de busca)
 {
    // Vamos sair daqui
 }
 // se chegamos aqui, e porque nao e um motor de busca, então podemos continuar

The difficulty and the test. Has 2 options:

  1. you want to autorisate only one type of browser. For example, you want to be the only one to have Acceso. In this case, you will make the test Type: if the HTTP_USER_AGENT=my browser, all right, if not bye bye! Easy because you know your browser’s HTTP.

  2. you want to prohibit access to the engines. But in this case, you need to meet the engines HTTP_USER_AGENT... I find it impossible, because has a lot and has no norm on it.

For example here the HTTP_USER_AGENT of 4 "bots" (search engine).

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)

msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

Turnitinbot/3.0 (http://www.turnitin.com/robot/crawlerinfo.html)

They are quite different from each other, and to check in PHP that they are search engine, I find it quite complicated. You need to find another option.

One question: what is the real goal? Security? Privatity?

Browser other questions tagged

You are not signed in. Login or sign up in order to post.