Concept of blocking directories in robots.txt

Question

Concept of blocking directories in robots.txt

Asked 5 years, 9 months ago

Viewed 115 times

4

I have a question regarding the blocking of files and directories through robots.txt. The structure of my site is composed of directories with backend files (models, controllers and classes), that are responsible for generating the content.

Follows the structure:

User navigation will always be through index.php, which will trigger the backend files of the other folders and generate the content.

robots txt.:

User-agent: *
Disallow: /admin/
Disallow: /.git/
Disallow: /project/
Disallow: /model/
Disallow: /objects/
Disallow: /controller/

Sitemap: https://www.detetiveparticularemsp.com.br/sitemap.xml

It is necessary (suggestible) to block these content processing files/folders (backend) in robots, leaving only images, css and index.php available?

What do you mean by played, by the robots?

– MagicHat

2020/01/02 at 14:38
@Magichat Interpreted by trackers accessing the robots.txt file

– Marcus Oliveira

2020/01/02 at 14:41
In this case, it will depend on the location url of the "content". It’s a kind of directory schema, if you push "/root", you have nothing to index inside "/root".

– MagicHat

2020/01/02 at 14:45
You block everything you don’t want to be tracked, if /index generates the content and it’s not locked, it will be indexed. The lock tells the tracker that at a certain level of the directories, you don’t even need to spend resources to check.

– MagicHat

2020/01/02 at 15:55
It would be interesting to add the robots you are developing.

– MagicHat

2020/01/02 at 15:57
@Magichat then think it would be feasible to lock all folders/files, and insert in the robots only the permission of index,images,css and js

– Marcus Oliveira

2020/01/02 at 16:01
But what is the sense of the tracker to index css, js... They have relevant content?

– MagicHat

2020/01/02 at 16:05
Stick to the goal of the webcrawler, index urls, from the moment the url is indexed, any other program that has access to this index will have an optimized path to access it, css and js will be used normally by the resource that requested it.

– MagicHat

2020/01/02 at 16:10
@Magichat I read some articles that suggest the release of these resources so that the trackers can know if the site is mobile-friendly I’m starting now to study about SEO, I’m trying to filter this information I read on some sites and blogs https://yoast.com/dont-block-css-and-js-files/

– Marcus Oliveira

2020/01/02 at 16:42

Show 4 more comments

2 answers

4

In the context of Webcrawler, the tracker is a program that sends requests to domains in order to verify, index and organize the resources available for future access in an optimized way.

One of the reasons to have a robots is to avoid unnecessary requests your application consuming resources.

Whenever a certain url is indexed it will be monitored with a certain frequency consuming resources(from both client parties(in this case the webcrawler) and server(in this case the environment of your application)) every time it is requested.

So, yes, it’s a good practice for you to inform webcrawlers that there are resources that should not be indexed.

The client appoints a User-agent, that upon having access to instruction Disallow: / of a robots.txt, understands that you should not index anything inside since directory, however, if there is then an instruction like Allow: /esse-sim he will index.

As stated in the comments robots.txt read files .txt in order to find urls to be part of their index, who uses the css is its application, who uses the js is the browser that is also a client like webcrawler, but is prepared to search for js, unlike the webcrawler that only wants the urls.

See, the webcrawlers are created for various purposes.

When you report on a robots.txt an instruction, not to say that it is "blocking access to the resource", you are saying that there, has nothing relevant to the context that your application proposes.

That means if one chinaspider, want to see if your application contains any vulnerability in /wp-admin, even with a denying instruction on a robots.txt, still he can request the resource and index in his own index.

In the context of Google, it has its own Users-Agents, for various platforms and purposes, which due to their specific purpose should be grateful to be informed of the presence of irrelevant resources.

Browser other questions tagged google seo google-search

You are not signed in. Login or sign up in order to post.

by Maniero • **444,682** points · Answer 1 · 2020-01-02T14:47:38+00:00

The robot.txt only prevents the search engine (Google) access to content, not from the normal browser. And even then only if the search engine wants to do this. This is not a universal protection, it’s a "gentlemen’s agreement," if you’re thinking of it as protection, forget it.

The searcher will ignore everything you have in the folder where you have this file and will follow the instruction indicated within it rather than ignore it. That is, you only use this in content that you don’t want the searcher to index. This means you wanted to make this content public but you don’t want it to be indexed. That’s all, it’s no good at all.

Understand that what you have within the structure of your website is not necessarily accessed publicly, by the search engine, browser or otherwise. But it can be accessed if the content is public. The access will be according to some triggers, for example having a link to the page there. Another example is to have a sitemap as it did. The biggest reason to have a sitemap is to want the content to be properly indexed, so put the robots.txt doesn’t make any sense, or has one or has the other, they object.

If what you want is to have none of these files accessible then you must configure the HTTP server appropriately so that these files are not made public, or at least that they cannot be read directly and can only be executed. It is common for people to use ready-made settings that already solve this. If you don’t know how to do it properly then I suggest hiring a professional to do it for you.

If everything is set up correctly then your website will run the files .php interned on the server and generate an external content sent to those who requested it. The file will not be read directly, no one will pick up your code (if everything is configured correctly).

How did you put a robots.txt then by this folder will not be read (nor the index.php will be called, your site will never be indexed. And as said in comments that only the index.php is accessed and that is content that seems to be public, and still has a *sitemap*, a sugestão é só tirar orobots, txt`, it’s preventing Google and other search engines from indexing your content.

The impression is that it is protecting some folders from being accessed in any way with the ronots.txt, but this does not happen. Protection is given by HTTP server configuration (Apache, IIS, etc.) and the permissions given to files and folders in the filesystem of the operating system.