Protect automated access web pages

Asked

Viewed 1,417 times

10

How can I protect my web pages so that they are not accessed in an automated way?

  • By search bots Engines like Googlebot (I think the basic form was the metatag with noindex and nofollow).
  • By Headless Browser (are browsers without graphical interface and responding to commands via command line and/or scripts, can batch access thousands of pages).
  • By scripts artesanais (are scripts (usually in PHP which I have some knowledge) that can access in lot thousands of pages using common functions such as file_get_html or file_get_contents).

OBS: The last two topics is possible to configure the HTTP field user_agent so that the script/headless browser pass through a common like firefox.

OBS2: Related question: What this anti-roboe code in Javascript does?

  • Complementing the first OBS: checking the User Agent is naive because the practice of agent spoofing is very common. The same applies to other measures of Browser sniffing that try to check the browser with Javascript, as the bot may be using a tool that automates common browsers (such as the Selenium).

  • Wouldn’t it be an option to encode some part with Eval among others, to avoid it being completely visible? You could use mechanisms to always change the encrypted content in order to avoid decoding in php, you could use random code for example. Maybe it would help against the file_get_contents.

1 answer

4


Block access to search engine bots is very different from other cases. The first respects the rules you create, while the others try to circumvent any rules... is a game of cat and mouse.

How restricted access to official search engines is trivial and has extensive documentation, I will focus on the methods that hinder access by these other webcrawlers unregulated.

Do not use sequential Urls

Take care of the pages you get, for example in the format www.site.com/dados.php?id=100. Writing a script that downloads a batch of data from this site would be as easy as this simple command on the UNIX terminal curl -O www.site.com/dados.php?id=[100-1000].

Load content by AJAX

This prevents simple scripts (whether in Bash, PHP, Python, etc.) from accessing the content, as they do not have a Javascript interpreter (some even have an HTML interpreter). They just download the page by HTTP. Even, it is part of the SEO techniques to avoid pages that make high use of AJAX, because is difficult even for Google index them correctly.

But be careful not to facilitate their action when implementing an AJAX solution that returns a JSON ready to be parsed. You should implement CSRF tokens to restrict access to JSON/XML only to those who have already loaded the main page, otherwise it will make their job easier rather than difficult.

However, nothing prevents anyone minimally engaged to use a headless browser like the Phantomjs, that is able to interpret Javascript and load the whole page.

Captcha

The image with distorted letters and psychedelic background will cut many crawlers. Still, this most popular method for identifying humans is not foolproof.

There are Ocrs capable of reading Ptchas, but it is laborious to program them, as well as being specific to each captcha generating mechanism. Slight changes to the captcha algorithm can take a lot of work to update the OCR, which, depending on how often you do this, can make the procedure unviable.

There are also specialized services in reading Captchas, such as DBC and Behead. They charge a few bucks to solve a thousand Caps. The advantage is that they are able to break whichever captcha, until the old model of Google reCAPTCHA, considered unbreakable for some time. That’s because we don’t have a robot trying to impersonate a human. These services employ human workers in countries with cheap labor who keep typing the letters 24/7.

IP blocking

This mechanism is fundamental. It is a separate science that aims to separate the wheat from the chaff, that is, the robot from the human, through patterns of "behavior". I recommend the excellent article of Code Horror that deals with the theoretical part of the subject with great analogies.

The implementation on your site can be done through middleware if you are using some framework or "in race" using the fail2ban.

This method can be circumvented if the bot uses proxies. But in this case, the cost factor will be higher for the attacker, as its mechanism will burn the Ips he hired to use.


With the combination of these methods, it is possible to avoid many crawlers.

But as I’ve shown, it’s impossible to know for certain whether a requisition was made by a robot or a human. Even if you implement all these measures, in the end it will only depend on the cost-benefit ratio that the bot author calculated (cost in the monetary and effort sense). So anyone who doesn’t want to be snooped around by bots should have someone who monitors access, checks for abuse, reinventing blocking techniques as bots learn to circumvent old techniques. A game of cat and mouse.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.