How to detect a bot

Asked

Viewed 594 times

0

I’m helping a friend develop a visitation system, like the Rede Grana ex: Social Money. It turns out that we will make payments for real visits on the page, and as we know that there are people who are malicious and who will try to circumvent the system to take advantage and gain views, such as using a fictitious user, a bot (Hitleap).

I need to know how to differentiate a real view from a view by bot. I already looked for a solution with the HTTP_USER_AGENt but getting nothing, I also compared it to real views and found nothing I can use.

What would be the best solution to protect yourself from this kind of case, something like Youtube can already perform, distinguish the real accesses from non reai.

Thanks in advance...

P.S.: I know how to detect common indexers! So don’t show me articles about googlebot.

  • like I said, Daniel, you can check if it’s a robot through the recaptcha

  • It is something unpleasant for the user this. Because the purpose does not interact with the page, just visualize it.

  • Behold if that help you.

  • How do you know I’m not a bot commenting here for you at SOPT?

  • @Ivanferrer was not that...

  • And @Luizvieira I think a bot wouldn’t ask me that.

  • 1

    A good enough bot could ask that. There are actually other indications that I’m not a bot that are better than my previous message or even that. You can look, for example, my history of participation on the site. My point with this joke (sorry for her, by the way, it was just a joke) is that without analyzing some interaction history will be difficult to detect something. Unless you do as the answers you have already suggest, and ignore known bots origins. They are very good solutions, but not necessarily infallible. :)

Show 2 more comments

2 answers

2


I think the only effective way is by using Captcha, other ways are easy to circumvent.

There are good ways to estimate the number of visitors, an example is the view Count of OS, but even this method can be circumvented with distributed bots or using Proxy.

  • This answer was initially a comment, but, if you have nothing better, I think it is good enough to indicate a way to research more.

1

One way to do this is to create rules in the .htaccess, that prevent some known agents who are robots, hence you would have to have a complete list or search for a complex list of these agents:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} facebookexternalhit [NC,OR] 
RewriteCond %{HTTP_USER_AGENT} Twitterbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MetaURI [NC,OR]
RewriteCond %{HTTP_USER_AGENT} mediawords [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlipboardProxy [NC]
RewriteCond %{REQUEST_URI} !\/sem_crawler.htm
RewriteRule .* http://seusite.com.br/sem_crawler.htm [L]

Another way is by making use of PHP:

<?php 
class CrawlerDetect
{
   //lista de robôs
  private $agentsInvalids = array(
    'Google'=>'Google',
    'MSN' => 'msnbot',
    'Rambler'=>'Rambler',
    'Yahoo'=> 'Yahoo',
    'AbachoBOT'=> 'AbachoBOT',
    'accoona'=> 'Accoona',
    'AcoiRobot'=> 'AcoiRobot',
    'ASPSeek'=> 'ASPSeek',
    'CrocCrawler'=> 'CrocCrawler',
    'Dumbot'=> 'Dumbot',
    'FAST-WebCrawler'=> 'FAST-WebCrawler',
    'GeonaBot'=> 'GeonaBot',
    'Gigabot'=> 'Gigabot',
    'Lycos spider'=> 'Lycos',
    'MSRBOT'=> 'MSRBOT',
    'Altavista robot'=> 'Scooter',
    'AltaVista robot'=> 'Altavista',
    'ID-Search Bot'=> 'IDBot',
    'eStyle Bot'=> 'eStyle',
    'Scrubby robot'=> 'Scrubby',
    ...
    );
//lista de navegadores válidos
private $agentsValids = array(
    'Mozilla' => 'Mozilla',
    'Chrome'  => 'Chrome',
    'Safari'  => 'Safari',
    'Opera'   => 'Opera',
     ...
);


public function __construct($USER_AGENT)
{
    $invalids =  implode('|',$this->agentsInvalids);
    $valids =  implode('|',$this->agentsValids);
    /* aqui você escolhe como prefere,
    acredito que basta testar uma única lista */
    if (strpos($invalids, $USER_AGENT) !== false ||
        strpos($valids, $USER_AGENT) === false) {
       return true;
    } else {
       return false;
    }
}

//verifica o navegador

$crawler = new CrawlerDetect($_SERVER['HTTP_USER_AGENT']);

//se for robô ele verifica
if ($crawler) {
  echo "acesso inválido!";
} else {
  echo "acesso válido!"; 
}

On this website has a complete or near-complete list showing a full list of brownsers and crawlers.

  • My problem is with other types of bots. Google, Yahoo, Facebook has a unico HTTP_USER_AGENT, already using them with Hitleap HTTP_USER_AGENT is like a common user. Google=> Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Common user and users with Hitleap => Mozilla/5.0 (Linux; Android 4.4.4; SM-G530BT Build/KTU84P) Applewebkit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/33.0.0 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/77.0.0.20.66;] It is very easy to identify a bot from an indexer, now those created by the users themselves no longer know how.

  • So why don’t you just index the allowed ones, so bar all the others. Note that I posted the site address that has common browsers.

  • One way you avoid BOTS is by getting personal information, there are no better ways than this. Another way is to get a logical reasoning from the user, that is, create something that only a user could access.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.