In the context of Webcrawler, the tracker is a program that sends requests to domains in order to verify, index and organize the resources available for future access in an optimized way.
One of the reasons to have a robots
is to avoid unnecessary requests your application consuming resources.
Whenever a certain url
is indexed it will be monitored with a certain frequency consuming resources(from both client parties(in this case the webcrawler) and server(in this case the environment of your application)) every time it is requested.
So, yes, it’s a good practice for you to inform webcrawlers
that there are resources that should not be indexed.
The client appoints a User-agent
, that upon having access to instruction Disallow: /
of a robots.txt
, understands that you should not index anything inside since directory, however, if there is then an instruction like Allow: /esse-sim
he will index.
As stated in the comments robots.txt
read files .txt
in order to find urls
to be part of their index, who uses the css
is its application, who uses the js
is the browser that is also a client like webcrawler, but is prepared to search for js
, unlike the webcrawler that only wants the urls
.
See, the webcrawlers
are created for various purposes.
When you report on a robots.txt
an instruction, not to say that it is "blocking access to the resource", you are saying that there, has nothing relevant to the context that your application proposes.
That means if one chinaspider
, want to see if your application contains any vulnerability in /wp-admin
, even with a denying instruction on a robots.txt
, still he can request the resource and index in his own index.
In the context of Google, it has its own Users-Agents, for various platforms and purposes, which due to their specific purpose should be grateful to be informed of the presence of irrelevant resources.
What do you mean by played, by the robots?
– MagicHat
@Magichat Interpreted by trackers accessing the robots.txt file
– Marcus Oliveira
In this case, it will depend on the location url of the "content". It’s a kind of directory schema, if you push "/root", you have nothing to index inside "/root".
– MagicHat
You block everything you don’t want to be tracked, if /index generates the content and it’s not locked, it will be indexed. The lock tells the tracker that at a certain level of the directories, you don’t even need to spend resources to check.
– MagicHat
It would be interesting to add the robots you are developing.
– MagicHat
@Magichat then think it would be feasible to lock all folders/files, and insert in the robots only the permission of index,images,css and js
– Marcus Oliveira
But what is the sense of the tracker to index css, js... They have relevant content?
– MagicHat
Stick to the goal of the webcrawler, index
urls
, from the moment theurl
is indexed, any other program that has access to this index will have an optimized path to access it, css and js will be used normally by the resource that requested it.– MagicHat
@Magichat I read some articles that suggest the release of these resources so that the trackers can know if the site is mobile-friendly I’m starting now to study about SEO, I’m trying to filter this information I read on some sites and blogs https://yoast.com/dont-block-css-and-js-files/
– Marcus Oliveira