Is indexing sites by half possible?

Asked

Viewed 66 times

2

When browsing the internet reading articles and more articles, I came across some that are only displayed in half, or just the beginning, here’s an example. In this case, only those who are subscribers have free access.

My question is?

Is the site fully indexed by the search engines? or is the page indexed in half? the file robots.txt is related to the fact?

How is this type of delimitation possible on the content of a site, as to what the reader may or may not read?

2 answers

3


The question talks about two things, partial indexing of the site and partial indexing of the page.

Yes, you can use robots.txt or the meta name="robots" to indicate what you do not want to index.

In fact, if the content is protected in some other way from access, it doesn’t even need it. If the page is only properly accessed by password the content will not be indexed.

In fact this is the only effective way to prevent indexing, since saying you don’t want something indexed is just a convention. An indexer may not respect this. Today Google respects it, but it may not respect it when it wants to. There are malicious indexers.

Obviously only the access control done on the server will have effectiveness.

All this is well known fact. I think the biggest doubt is the partial indexing of the page.

This is usually done by identifying that the client asking for the page is an indexer. It generates a different page with partial content when it is a known indexer. Obviously it is possible to fool the site by saying it is the indexer. Then the indexer gets the whole page and can index all the content, but a normal client gets the page capped. This can bring penalties in the indexing classification if the mechanism identifies the maneuver. Of course, it will always be possible to access the content by caching the indexer or website that does the same as the indexer (e.g., Outline.com).

Obviously, you can upload all the content and limit it via Javascript. This protects nothing, it only deceives, since the content is there. It can be difficult for the layman, but it has no protection.

I take this opportunity to say about the myth that indexers run Javascript. Yes some do, but not all. And they can’t simulate user actions like a real user does, so don’t count on indexing if there’s user interaction or some other way that depends on things the indexer can’t do, and new things always come up that the indexer can’t simulate. The programmatic code exists on the page precisely to define flows that are not standardized and this, by definition, makes it impossible in practice to try to simulate everything that can occur.

If you want to protect the contents, just control it on the server. And obviously it will control the display, it does not prevent the person from copying and posting elsewhere, even automatically. It’s good to be clear because some people think Santa Claus exists.

  • So what really happens is the complete indexing by the search engines, and through the programming logic on the part of the server that controls what can and cannot be seen by the users. in which case the control can be done via javascript (hiding the content) or via php (some logic running on the server side) that’s it?

  • Yeah, but like I said, it doesn’t really protect you. Either you protect or you index, you can’t do both. That’s why I put the last sentence..

  • I get it, but a hypothesis has occurred to me now, imagine that one article post is split in half, then the first half is posted and published and the other half is not. (in this case there is no way to index since it does not exist (yet)), then the user logs in and quickly the server send that missing chunk (second part of the post) in this hypothetical case did not occur the protection and indexing together?

  • No, just the protection, the other part isn’t visible to anyone, so it hasn’t been indexed. If it was indexed can be seen by the indexer cache, then it is not protected.

  • Fantastic, thank you very much.

2

Taking the Googlebot as an example...

According to this article on kissmetrics, Google’s Crawler indexes the entire page, including title, description, attribute alt of images and all the content.

According to this other article, in Search Engine Land, Googlebot is even able to process Javascript in order to index content included in the DOM dynamically. This other article shows comprehensive indexing tests of pages using frameworks and most popular Javascript libraries (spoiler: it seems that it still does not process Angularjs v2 very well. Just the framework from Google itself, will understand...).

Since indexing depends on Crawler being able to access the page, the link to it needs to exist somewhere on the internet or its indexing should be explicitly requested through Google’s webmaster tools.

So if a web Crawler is able to reach an area of restricted content to index it, so is a human being, and content is no longer restricted. For a restricted content area to be efficient, there should be no links to it that are not blocked by some kind of authentication.

The archive robots txt. represents a map of what the webmaster wants to be indexed or not by crawlers, for example:

User-agent: *
Disallow: /restrito/

Robots from large search engines tend to obey such guidelines, but it should be remembered that not everyone respects the rules and policies of good neighbourliness. If the content is restricted, display it only under authentication.

Therefore, Googlebot seems to be able to index all the content of a page, including that "revealed" dynamically via Javascript, but not what’s behind an authentication check, as an example of a subscriber area. If the webmaster of this article wishes it to be indexed under terms that are only in the restricted content, he must copy them in some way to the public part of the page, as in the form of keywords within the tag <description>.

So finally answering your question: yes, it is possible that websites are indexed "in half", for crawlers are only able to index what they can access. Restricted areas are not indexed.

  • 1

    Okay, maybe I’ve confused JS in general with this note on Ajax, but here’s the warning of any luck: https://webmasters.googleblog.com/2015/10/deprecating-our-ajax-crawling-scheme.html

  • Apparently Googlebot processes JS yes, at least this other article demonstrates even more complete tests in the case of popular frameworks and libraries. I’ll include an example of robots yes, thanks!

  • In fact, I mistook it for the ad about Ajax...

  • @nunks.lol if the file robots.txt serves only as a warning, not actually prevented from indexing, (since some trackers may not obey the good neighborhood policy) then it is only effective to restrict content if there is a programming logic on the server side?

  • @Jonathasb.Cavalcante Isso. To ensure that no one will index the restricted area of your site you have to include this in your logic. It is not advisable to trust the programmer of the Crawler will respect robots.txt. Google and Microsoft say they respect, but there are a multitude of others crawlers and web scrapers individuals who may not respect. Think about crawlers as potentially malicious human visitors: if you breach access to restricted content without checking if they have authorization, they will access that content.

  • I understand, I appreciate the answer. =)

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.