What HTTP methods can a Crawler not track?

Question

What HTTP methods can a Crawler not track?

Asked 9 years, 7 months ago

Viewed 793 times

3

A conceptual doubt (or not):

Of the HTTP methods, which of them cannot be "tracked" - or interpreted - by a Crawler?

POST
GET
PUT
PATCH
DELETE

Someone with knowledge on the subject can answer us?

2 answers

0

By theory, crawlers usually perform safe and idempotent methods - OPTIONS, GET, HEAD.

Of the book "Cloud Standards: Agreements That Hold Together Clouds": "Web crawlers, for example, use only safe methods to avoid disturbing data on the sites they Crawl"

Or else: "Web crawlers, for example, use only secure methods to prevent unsettling data on crawling websites"

Which makes perfect sense for the purpose of a Rawler, if we think logically.

A great reference on the subject is https://www.whitehatsec.com/blog/http-methods/

Idempotency and Safety are Important Attributes of HTTP methods. An idempotent request can be called repeatedly with the same Results as if it only had been executed Once. If a user clicks a thumbnail of a cat picture and Every click of the picture Returns the same big cat picture, that HTTP request is idempotent. Non-idempotent requests can change each time they are called. So if a user clicks to post a comment, and each click produces a new comment, that is a non-idempotent request.

Safe requests are requests that don’t alter a Resource; non-safe requests have the Ability to change a Resource. For example, a user posting a comment is using a non-safe request, because the user is Changing some source on the web page; However, the user clicking the cat thumbnail is a safe request, because clicking the cat picture does not change the Resource on the server.

Production safe crawlers consider Certain methods as Always safe and idempotent, e.g. GET requests. Consequently, crawlers will send GET requests arbitrarily without worrying about the Effect of repeated requests or that the request Might change the Resource. However, safe crawlers will recognize other methods, e.g. POST requests, as non-idempotent and unsafe. So, good web crawlers won’t send POST requests.

RFC on Safe and Idempotent Methods: http://w3.org/Protocols/rfc2616/rfc2616-sec9.html

9.1.1 Safe Methods

Implementors should be Aware that the software represents the user in their interactions over the Internet, and should be careful to allow the user to be Aware of any actions they Might take which may have an Unexpected Significance to themselves or others.

In particular, the Convention has been established that the GET and HEAD methods SHOULD NOT have the Significance of taking an action other than Retrieval. These methods ought to be considered "safe". This Allows user Agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made Aware of the Fact that a possibly unsafe action is being requested.

Naturally, it is not possible to ensure that the server does not generate side-effects as a result of Performing a GET request; in Fact, some Dynamic Resources consider that a Feature. The Important Distinction here is that the user Did not request the side-effects, so therefore cannot be Held accountable for them.

9.1.2 Idempotent Methods

Methods can also have the Property of "idempotence" in that (aside from error or expiration issues) the side-effects of N > 0 identical requests is the same as for a single request. The methods GET, HEAD, PUT and DELETE share this Property. Also, the methods OPTIONS and TRACE SHOULD NOT have side effects, and so are inherently idempotent.

However, it is possible that a Quence of several requests is non- idempotent, Even if all of the methods executed in that Sequence are idempotent. (A Sequence is idempotent if a single Execution of the entire Sequence Always yields a result that is not changed by a reexecution of all, or part, of that Sequence.) For example, a Sequence is non-idempotent if its result depends on a value that is later modified in the same Quence.

A Sequence that Never has side effects is idempotent, by Definition (provided that no Concurrent Operations are being executed on the same set of Resources).

1

I think if you put one more block underneath translating the term into yellow helps. The way it is, it doesn’t make much sense at Sopt.

– Bacco

2016/03/23 at 20:53
1

Unfortunately, although the stack is in Portuguese, the world we live in is full of English content. This implies, even if we don’t like it, learning to live with the language.

– Danilo Gomes

2016/03/23 at 20:54
Friend, the question is which http method a Crawler cannot track...

– Marllon Nasser

2016/03/23 at 20:55
In programming we will always come across English, a translation would be cool but I do not think mandatory. it would be interesting to add this question to our community’s META.

– Isvaldo Fernandes

2016/03/23 at 20:56
@Marllonnasser, Crawler can track any http verb

– Isvaldo Fernandes

2016/03/23 at 20:57
@Marllonnasser the concepts are directly linked.

– Danilo Gomes

2016/03/23 at 20:57
Is there any "scientific proof" of this? Any source -reliable- that shows this concept?

– Marllon Nasser

2016/03/23 at 20:59
2

The site is in Portuguese, so it is natural that we expect a response in Portuguese. You cannot assume that EVERY user who participates here is required to understand English. I myself do not know English and the answer would not help me if I had a similar problem.

– user28595

2016/03/23 at 21:00
RFC on Safe and Idempotent Methods: https://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html

– Danilo Gomes

2016/03/23 at 21:03
@Diegof This discussion is not quite resolved yet. http://meta.pt.stackoverflow.com/questions/883/citando-conteudo-ingles/884#884

– Isvaldo Fernandes

2016/03/23 at 21:03
1

Feel free to translate the content, such as technical expressions, Crawler, put, get, options, head and also RFC, if applicable.

– Danilo Gomes

2016/03/23 at 21:04
@Marllonnasser, Scientific proof ? I don’t think you understand the very least of what you’re asking. you can test the code and read the references that were passed to you. there is no way to have scientific proof if you do not have a falsehood point

– Isvaldo Fernandes

2016/03/23 at 21:06
Soon you’ll be treating Karl Popper kkk

– Bacco

2016/03/23 at 21:09
"Web crawlers, for example, use only safe methods to avoid disturbing data on the sites they Crawl" - Book "Cloud Standards: Agreements That Hold Together Clouds"

– Danilo Gomes

2016/03/23 at 21:09
1

@Isvaldofernandes Do not need to translate "in full" the entire quote, but make a brief explanation (does not need to be a redaction of the ENEM too) about what the quote says, already makes the answer much more complete. I understand the difficulties in translating technical things, but a summary already helps those who read to have an idea.

– user28595

2016/03/23 at 21:11
1

@Diegof Idea summarized in the answer edition

– Danilo Gomes

2016/03/23 at 21:16
@Isvaldofernandes the question is not this. I have no falsehood because I do not have the concept. And I can’t rely on, "Oh, it works that way because so-and-so of the SOPT said," do you agree? But okay... I think I was convinced by the answers.

– Marllon Nasser

2016/03/24 at 04:09

Show 12 more comments

Browser other questions tagged http web-crawler

You are not signed in. Login or sign up in order to post.

by Isvaldo Fernandes • **1,690** points · Answer 1 · 2016-03-23T20:50:59+00:00

This is independent of Crawler, you can simulate any request.

Curl

 curl --request POST 'http://www.somedomain.com/'
 curl --request DELETE 'http://www.somedomain.com/'
 curl --request PUT 'http://www.somedomain.com/'

source: Link

Python

r = requests.put("http://httpbin.org/put")
>>> r = requests.delete("http://httpbin.org/delete")
>>> r = requests.head("http://httpbin.org/get")
>>> r = requests.options("http://httpbin.org/get")

source: Link

Java

GetRequest request = Unirest.get(String url);
GetRequest request = Unirest.head(String url);
HttpRequestWithBody request = Unirest.post(String url);
HttpRequestWithBody request = Unirest.put(String url);
HttpRequestWithBody request = Unirest.patch(String url);
HttpRequestWithBody request = Unirest.options(String url);
HttpRequestWithBody request = Unirest.delete(String url);

Lib:Link