Index error as non-existent

Asked

Viewed 68 times

0

I’m getting an error message on line 7 of the code, which says url=to_crawl[0] - Indexerror: list index out of range

import requests

import re

to_crawl=['https://www.globo.com']

crawled=set()

header={'user-agent':'Mozilla/5.0 (X11; Linux i686; …) Gecko/20100101 Firefox/62.0'}

while True:

    url=to_crawl[0]
    try:
        req=requests.get(url, headers=header)

    except:
        to_crawl.remove(url)
        crawled.add(url)
        continue

    html=req.text
    links=re.findall(r'<a href="?\'?(https?:\/\/[^"\'>]*)', html )
    print("Crawling:", url)

    to_crawl.remove(url)
    crawled.add(url)

    for link in links:
        if link not in crawled and link not in to_crawl:
           to_crawl.append(link)
  • You have a loop infinite that does not possess a break. Sooner or later the values will end in to_crawl, there is no index 0, because the list will be done. When your code should stop running?

  • The program would end when I went all over the site in search of links, when this error appeared I could not find the problem. In that case I would break ?

  • So instead of while True could be while to_crawl

  • Not giving the same error, but the program is not going through the site as it should, simply skip the instruction and end the program, I believe it is because of the to_crawl no while !

1 answer

0

Inside your block while you are putting the first item on the list to_crawl to make the connection, which is checked inside the Try/except block. When an error occurs and it enters the block except and removes the only URL that existed in the list, leaving the same empty, then it switches to the next iteration, which repeats the command to assign the first item of the list in the variable url, but it turns out that the list is empty, so this problem occurs.
What’s making your connection fall into the except block is in your dictionary header, there is a character in the client statement, when I withdrew to test, the query was successfully made. So you can take off, leaving the definition of this line like this:

header={'user-agent':'Mozilla/5.0 (X11; Linux i686; ) Gecko/20100101 Firefox/62.0'}

A possible solution for your block except could be a check to see if the list is empty, and if it is, finish the script, for example:

try:
    req=requests.get(url, headers=header)
except:
    to_crawl.remove(url)
    crawled.add(url)
    if not to_crawl:
        break
    continue

I hope I’ve helped! :)

Browser other questions tagged

You are not signed in. Login or sign up in order to post.