Problem of wget in python

Asked

Viewed 331 times

-1

I’ve come to report a mistake that’s causing me to bump my head. I am trying to make a code to download books (pdf) from a Laotian site because it is almost impossible to download all these books manually, so I tried to do the wget https://lao-online.com/books/download/1.html only in python and changing the links that follow a pattern, and the code went like this

import wget
count = 1
print(f'Vamo atrás do {count}° link')

while count < 1800:
    url = (f'https://lao-online.com/books/download/{count}.html')
    print(url)
    sleep(1)
    wget.download(url)
    sleep(1)
    filename = wget.download(url)
    print('Sucesso!')
    count+=1
quit()

but for some reason it seems that python’s wget library won’t let me download pdf vide since I was able to download other media content. When I try to run the python code I can’t download anything and it returns me this error:

Traceback (most recent call last):
  File "/home/mathie/Laos/raspagem.py", line 14, in <module>
    wget.download(url)
  File "/home/mathie/.local/lib/python3.9/site-packages/wget.py", line 526, in download
    (tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback)
  File "/usr/lib/python3.9/urllib/request.py", line 239, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/usr/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.9/urllib/request.py", line 523, in open
    response = meth(req, response)
  File "/usr/lib/python3.9/urllib/request.py", line 632, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.9/urllib/request.py", line 561, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.9/urllib/request.py", line 641, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

2 answers

1

this error you get is not from Python or wget - it’s from the site you’re downloading from. the error code is generated there: "403 Forbidden". Possibly the user has to be logged in to be able to download the books - ai have to log in using Python before.

(entered the site - if you look at the details page of each book, is in fact, near the footer the instruction "login to download" , cm a link to the login page)

In this case, if there are no other download restrictions, the solution will be to use lib "requests" (instead of wget), and set up a "Session", posting the login information at the appropriate address - there, within Python, the object "Session" will have the same information that a browser would have (if they are cookies - some websites may use specialized HTTP headers to inform that they are logged in, in which case this also has to be written in the program).

Then, using the "get" method of the "Session" object the download should work.

Later I may have time to create an example by working - (and maybe not) - for now is the map of the stones there.

  • So my dear, well, let’s say that Laos does not have much security on its websites to limit its downloads and thanks to an exploit, when I access https://lao-online.com/books/download/32.html the site returns me the requested pdf, and as I mentioned when I use pure wget in the terminal wget me download the pdf, but in python it does not happen, well, I will try something the request, thanks for the walk of the XD stones

0


One way to solve is by using requests

import requests
import time
import re

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}

for i in range(3, 10):
    response = requests.get(f'https://lao-online.com/books/download/{i}.html', headers = header)
    response_headers = response.headers['content-disposition']
    file_name = re.findall('filename=(.+)', response_headers)[0]
    file_name = file_name.replace('"','')
    
    with open(f'./{file_name}', 'wb') as f:
        f.write(response.content)
        
    time.sleep(5)

Importing the necessary packages

import requests
import time
import re

Creating a header so the site does not refuse the request (avoid error 403)

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}

Here is where we took the file and saved, put 3 up to 10 for testing but you can change this range

for i in range(3, 10):
    response = requests.get(f'https://lao-online.com/books/download/{i}.html', headers = header)

Checking the file headers to get the name

    response_headers = response.headers['content-disposition']
    file_name = re.findall('filename=(.+)', response_headers)[0]
    file_name = file_name.replace('"','')

Saving the file

    with open(f'./{file_name}', 'wb') as f:
        f.write(response.content)
        
    time.sleep(5)

From what I saw the first 2/3 files are not interesting, so the range starts from 3.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.