Concept of blocking directories in robots.txt

Asked

Viewed 115 times

4

I have a question regarding the blocking of files and directories through robots.txt. The structure of my site is composed of directories with backend files (models, controllers and classes), that are responsible for generating the content.

Follows the structure:

inserir a descrição da imagem aqui

User navigation will always be through index.php, which will trigger the backend files of the other folders and generate the content.

robots txt.:

User-agent: *
Disallow: /admin/
Disallow: /.git/
Disallow: /project/
Disallow: /model/
Disallow: /objects/
Disallow: /controller/

Sitemap: https://www.detetiveparticularemsp.com.br/sitemap.xml

It is necessary (suggestible) to block these content processing files/folders (backend) in robots, leaving only images, css and index.php available?

  • What do you mean by played, by the robots?

  • @Magichat Interpreted by trackers accessing the robots.txt file

  • In this case, it will depend on the location url of the "content". It’s a kind of directory schema, if you push "/root", you have nothing to index inside "/root".

  • You block everything you don’t want to be tracked, if /index generates the content and it’s not locked, it will be indexed. The lock tells the tracker that at a certain level of the directories, you don’t even need to spend resources to check.

  • It would be interesting to add the robots you are developing.

  • @Magichat then think it would be feasible to lock all folders/files, and insert in the robots only the permission of index,images,css and js

  • But what is the sense of the tracker to index css, js... They have relevant content?

  • Stick to the goal of the webcrawler, index urls, from the moment the url is indexed, any other program that has access to this index will have an optimized path to access it, css and js will be used normally by the resource that requested it.

  • @Magichat I read some articles that suggest the release of these resources so that the trackers can know if the site is mobile-friendly I’m starting now to study about SEO, I’m trying to filter this information I read on some sites and blogs https://yoast.com/dont-block-css-and-js-files/

Show 4 more comments

2 answers

4

The robot.txt only prevents the search engine (Google) access to content, not from the normal browser. And even then only if the search engine wants to do this. This is not a universal protection, it’s a "gentlemen’s agreement," if you’re thinking of it as protection, forget it.

The searcher will ignore everything you have in the folder where you have this file and will follow the instruction indicated within it rather than ignore it. That is, you only use this in content that you don’t want the searcher to index. This means you wanted to make this content public but you don’t want it to be indexed. That’s all, it’s no good at all.

Understand that what you have within the structure of your website is not necessarily accessed publicly, by the search engine, browser or otherwise. But it can be accessed if the content is public. The access will be according to some triggers, for example having a link to the page there. Another example is to have a sitemap as it did. The biggest reason to have a sitemap is to want the content to be properly indexed, so put the robots.txt doesn’t make any sense, or has one or has the other, they object.

If what you want is to have none of these files accessible then you must configure the HTTP server appropriately so that these files are not made public, or at least that they cannot be read directly and can only be executed. It is common for people to use ready-made settings that already solve this. If you don’t know how to do it properly then I suggest hiring a professional to do it for you.

If everything is set up correctly then your website will run the files .php interned on the server and generate an external content sent to those who requested it. The file will not be read directly, no one will pick up your code (if everything is configured correctly).

How did you put a robots.txt then by this folder will not be read (nor the index.php will be called, your site will never be indexed. And as said in comments that only the index.php is accessed and that is content that seems to be public, and still has a *sitemap*, a sugestão é só tirar orobots, txt`, it’s preventing Google and other search engines from indexing your content.

The impression is that it is protecting some folders from being accessed in any way with the ronots.txt, but this does not happen. Protection is given by HTTP server configuration (Apache, IIS, etc.) and the permissions given to files and folders in the filesystem of the operating system.

  • So, in this case it is a website with a code that searches the content in the database, to make it easier for ADM’s to insert new articles or updates.

  • @Marcusoliveira then the content is generated in backend, these folders you mentioned are only in the backend, Right? Fall into what I said, robots.txt has nothing to do with it and access to it should not be publicly accessed. If you are accessing there is some other problem.

  • @Marcusoliveira, no, your comment is totally wrong.

  • @Maniero I understood his answer and clarified me almost completely my question, which really may not be so clear. I will correct (edit) the question for ease. I don’t know how to explain it in other words, but my project (site) has only one "index.php" file, which is responsible for treating which path was accessed by the user and returning the appropriate content from the database using the files and classes contained in the model folders,controller etc... Therefore, I wonder if I should diponivel in robots only index and style files

  • @Maniero + or I have to release also the content processing files

  • @Maniero with his last answer already made it clear that I should leave these folders locked in the robots, because the content is generated by the files contained in them , right ?

  • @If you really want to understand how this works, don’t skip steps in your learning, understand what really happens, and don’t rely on third-party affirmations. Edit your question by putting the necessary elements, including possible examples in the scope of your questions so that one question connects to another and helps improve the community collaboration experience, good luck ;)

  • @Magichat sorry, really not complete. I will edit specifying how the project is structured, so I believe it will facilitate to respond

  • @Marcusoliveira no, that’s not what I said, quite the contrary, what I said is that it’s irrelevant to use the robots.txt, if I understand what you’re doing.

Show 4 more comments

4


In the context of Webcrawler, the tracker is a program that sends requests to domains in order to verify, index and organize the resources available for future access in an optimized way.

One of the reasons to have a robots is to avoid unnecessary requests your application consuming resources.

Whenever a certain url is indexed it will be monitored with a certain frequency consuming resources(from both client parties(in this case the webcrawler) and server(in this case the environment of your application)) every time it is requested.

So, yes, it’s a good practice for you to inform webcrawlers that there are resources that should not be indexed.

The client appoints a User-agent, that upon having access to instruction Disallow: / of a robots.txt, understands that you should not index anything inside since directory, however, if there is then an instruction like Allow: /esse-sim he will index.

As stated in the comments robots.txt read files .txt in order to find urls to be part of their index, who uses the css is its application, who uses the js is the browser that is also a client like webcrawler, but is prepared to search for js, unlike the webcrawler that only wants the urls.

See, the webcrawlers are created for various purposes.

When you report on a robots.txt an instruction, not to say that it is "blocking access to the resource", you are saying that there, has nothing relevant to the context that your application proposes.

That means if one chinaspider, want to see if your application contains any vulnerability in /wp-admin, even with a denying instruction on a robots.txt, still he can request the resource and index in his own index.

In the context of Google, it has its own Users-Agents, for various platforms and purposes, which due to their specific purpose should be grateful to be informed of the presence of irrelevant resources.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.