How to get all files used on a page using "requests"?

Asked

Viewed 125 times

0

As we know, when accessing a website by browser, various files such as images, music, scripts, CSS, among others, are downloaded to be used within a page HTML.

Using the library requests, it is possible to use the function get to request the page, but it does not receive other files or at least the links to download those other files. So if I want to save a page on my computer, it would be incomplete, example:

Html page:

<!DOCTYPE html>

<html>
    <head>
        <title>Teste de JavaScript</title>
        <meta charset="utf-8">
    </head>

    <body>
        <img src="caozinho.png" alt="Cachorrinho">
    </body>
</html>

Python code to save page:

import requests

url = "https://meuSite.com/"
response = requests.get(url)

with open("página.html","wb") as file:
    file.write(response.content)

Final result:

inserir a descrição da imagem aqui

Is there any way to get all files from a request using the requests ?

  • The concept of listing directory is not present in HTTP protocol. The existence of this type of listing will depend on whether or not the server is configured to display an HTML with the folder structure and semantics in which this HTML is built.

  • 3

    Only Requests does not, but you can use it in the process. The solution to your problem is to take the contents of response.content, which is the page’s HTML, parsing the code by extracting all the elements you want and capturing their respective Urls. For example, you can search all the elements <img> and extract the value of the attribute src to list all the image Urls on the page. From the URLS to make a new request to download them one by one, exactly as the browser does.

  • But remember that the URL is by definition a opaque value. This means that your copy of the site may not be true to the original structure, just reproduce the same result (visual).

  • Thanks @Woss <3

No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.