One way to do this is to create rules in the .htaccess
, that prevent some known agents who are robots, hence you would have to have a complete list or search for a complex list of these agents:
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
RewriteCond %{REQUEST_URI} !\/sem_crawler.htm
RewriteRule .* http://seusite.com.br/sem_crawler.htm [L]
Another way is by making use of PHP:
<?php
class CrawlerDetect
{
//lista de robôs
private $agentsInvalids = array(
'Google'=>'Google',
'MSN' => 'msnbot',
'Rambler'=>'Rambler',
'Yahoo'=> 'Yahoo',
'AbachoBOT'=> 'AbachoBOT',
'accoona'=> 'Accoona',
'AcoiRobot'=> 'AcoiRobot',
'ASPSeek'=> 'ASPSeek',
'CrocCrawler'=> 'CrocCrawler',
'Dumbot'=> 'Dumbot',
'FAST-WebCrawler'=> 'FAST-WebCrawler',
'GeonaBot'=> 'GeonaBot',
'Gigabot'=> 'Gigabot',
'Lycos spider'=> 'Lycos',
'MSRBOT'=> 'MSRBOT',
'Altavista robot'=> 'Scooter',
'AltaVista robot'=> 'Altavista',
'ID-Search Bot'=> 'IDBot',
'eStyle Bot'=> 'eStyle',
'Scrubby robot'=> 'Scrubby',
...
);
//lista de navegadores válidos
private $agentsValids = array(
'Mozilla' => 'Mozilla',
'Chrome' => 'Chrome',
'Safari' => 'Safari',
'Opera' => 'Opera',
...
);
public function __construct($USER_AGENT)
{
$invalids = implode('|',$this->agentsInvalids);
$valids = implode('|',$this->agentsValids);
/* aqui você escolhe como prefere,
acredito que basta testar uma única lista */
if (strpos($invalids, $USER_AGENT) !== false ||
strpos($valids, $USER_AGENT) === false) {
return true;
} else {
return false;
}
}
//verifica o navegador
$crawler = new CrawlerDetect($_SERVER['HTTP_USER_AGENT']);
//se for robô ele verifica
if ($crawler) {
echo "acesso inválido!";
} else {
echo "acesso válido!";
}
On this website has a complete or near-complete list showing a full list of brownsers and crawlers.
like I said, Daniel, you can check if it’s a robot through the recaptcha
– Ivan Ferrer
It is something unpleasant for the user this. Because the purpose does not interact with the page, just visualize it.
– Helmesvs
Behold if that help you.
– Ivan Ferrer
How do you know I’m not a bot commenting here for you at SOPT?
– Luiz Vieira
@Ivanferrer was not that...
– Helmesvs
And @Luizvieira I think a bot wouldn’t ask me that.
– Helmesvs
A good enough bot could ask that. There are actually other indications that I’m not a bot that are better than my previous message or even that. You can look, for example, my history of participation on the site. My point with this joke (sorry for her, by the way, it was just a joke) is that without analyzing some interaction history will be difficult to detect something. Unless you do as the answers you have already suggest, and ignore known bots origins. They are very good solutions, but not necessarily infallible. :)
– Luiz Vieira