Do not allow the Curl command to download my entire website. Is it possible?

Asked

Viewed 120 times

-2

How could I change my HTML code, so that when do a query with curl don’t bring the entire HTML site to the console? A security flaw that I don’t know how to resolve on my personal website that I’m creating, can someone please help?

In the code $ curl http://www.uol.com.br the answer is:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://www.uol.com.br/">here</a>.</p>
</body></html>

On Amazon, $ curl http://amazon.com the answer is:

<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>Server</center>
</body>
</html>

When I look at the personal site I’m creating, so html, css and js, and I stored it on an AWS S3 Bucket, download the entire site, how can I fix it? Thank you very much.

2 answers

0


It turns out that the url you tried to access has a redirect, and the curl returned that first response that found, without treating that redirect.

Try the correct url, for example: curl https://www.amazon.com/, or use the L parameter to treat Redirects: curl -L http://amazon.com and you’ll see that it works.

I think you’re misunderstanding the concept, so you want to prevent the return of the website html, for a correct request?
curl is just a client, could be the browser, and would then prevent anyone from having access to your site.

In addition, the curl will not "download your interim site", only var return the has in the url, as if it were opened in the browser.

  • Thanks Ricardo, I was worried only with security, I made a simple website and hosted in AWS in S3 Bucket, but I put TLS and enabled HTTPS with a certificate even using cloudfront, I think it will stay that way even for now. Thanks for the feedback.

-2

The curl always returns the HTML of the site. What you are seeing in your test is a redirect of the http for https. If you change these tests to make the request with https in place of http will take the html you expect. ex: curl https://www.uol.com.br.

Note that the same may also occur to include or remove www, bar at the end and other things. To make sure, open the site in question in your browser and check the address that appears (or follow the "trace" by the reply of curl).

As for the first question, it is not possible to "protect" your website’s HTML, CSS or JS. However it is possible to hinder the understanding and reading of it, but it is the kind of thing that doesn’t make much sense and even is usually a joke (Scott Hanselman, inclusive, made this joke at the opening of the NDC Porto 2020).

If this is the objective, it is possible to differentiate the response of the server when using a browser or another, or even a program that is not a browser. In general this is done by headers, however, does not serve as "protection", since headers can be added to pass through a browser.

Another possibility is to not only have a "base" in the original HTML and load the entire site with JS, replacing the content of <body>, for example. A good example of this are the sites generated on the WIX platform.

An example of another example where Curl does not return the expected is Amazon.com, where the result of curl https://www.amazon.com/ is the image below.

resultado do curl à amazon.com

  • 3

    "This can be done by the user-agent" - only if it is to amaze curious. There is no way to differentiate from a browser by the user-agent for the simple fact that it is enough for a browser equal when using Curl. The only valid answer to this question would be "There is no way" (and could be supplemented with "almost sure there is no reason either"). IF the person wants to avoid data junk, can do other things to minimize automation (put a captcha, for example), but really there are few use cases that justify.

  • I will complement the @Bacco answer since I forgot to answer the first question and answered only the part of Curl to Uol and Amazon. And as you can see in the Amazon example, not only is it possible, as they did. I haven’t tested all the possibilities to confirm how they did.

  • 2

    That’s test failure, not Amazon protection. Try to put the headers equal to those of a common browser and you will see the same source as in the browser (which will not have much fun because it is full of JS, but will be the same).

  • Yes, and the 2 questions were answered. And thank you for commenting. When there are no comments it is impossible to know what is wrong in the answer.

  • Anyway, just for the record, these negatives are not mine. I am commenting to point out the problems only. The biggest problem is the premise of the question.

  • thanks anyway. I will complement the reply with a few more comments :)

  • Dias, thanks for the strength and for the very detailed reply, I am nubao on webpage, I did a homemade business, and hosted at AWS, I will study more and give a slap on the website, I think I will use wordpress. And' this I need to do, make a mistake on Curl when using the linux console, I will search more about, thanks for the gallows.

Show 2 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.