The question talks about two things, partial indexing of the site and partial indexing of the page.
Yes, you can use robots.txt
or the meta name="robots"
to indicate what you do not want to index.
In fact, if the content is protected in some other way from access, it doesn’t even need it. If the page is only properly accessed by password the content will not be indexed.
In fact this is the only effective way to prevent indexing, since saying you don’t want something indexed is just a convention. An indexer may not respect this. Today Google respects it, but it may not respect it when it wants to. There are malicious indexers.
Obviously only the access control done on the server will have effectiveness.
All this is well known fact. I think the biggest doubt is the partial indexing of the page.
This is usually done by identifying that the client asking for the page is an indexer. It generates a different page with partial content when it is a known indexer. Obviously it is possible to fool the site by saying it is the indexer. Then the indexer gets the whole page and can index all the content, but a normal client gets the page capped. This can bring penalties in the indexing classification if the mechanism identifies the maneuver. Of course, it will always be possible to access the content by caching the indexer or website that does the same as the indexer (e.g., Outline.com).
Obviously, you can upload all the content and limit it via Javascript. This protects nothing, it only deceives, since the content is there. It can be difficult for the layman, but it has no protection.
I take this opportunity to say about the myth that indexers run Javascript. Yes some do, but not all. And they can’t simulate user actions like a real user does, so don’t count on indexing if there’s user interaction or some other way that depends on things the indexer can’t do, and new things always come up that the indexer can’t simulate. The programmatic code exists on the page precisely to define flows that are not standardized and this, by definition, makes it impossible in practice to try to simulate everything that can occur.
If you want to protect the contents, just control it on the server. And obviously it will control the display, it does not prevent the person from copying and posting elsewhere, even automatically. It’s good to be clear because some people think Santa Claus exists.
So what really happens is the complete indexing by the search engines, and through the programming logic on the part of the server that controls what can and cannot be seen by the users. in which case the control can be done via javascript (hiding the content) or via php (some logic running on the server side) that’s it?
– Jonathas B. C.
Yeah, but like I said, it doesn’t really protect you. Either you protect or you index, you can’t do both. That’s why I put the last sentence..
– Maniero
I get it, but a hypothesis has occurred to me now, imagine that one article post is split in half, then the first half is posted and published and the other half is not. (in this case there is no way to index since it does not exist (yet)), then the user logs in and quickly the server send that missing chunk (second part of the post) in this hypothetical case did not occur the protection and indexing together?
– Jonathas B. C.
No, just the protection, the other part isn’t visible to anyone, so it hasn’t been indexed. If it was indexed can be seen by the indexer cache, then it is not protected.
– Maniero
Fantastic, thank you very much.
– Jonathas B. C.