0
I’m getting an error message on line 7 of the code, which says url=to_crawl[0]
- Indexerror: list index out of range
import requests
import re
to_crawl=['https://www.globo.com']
crawled=set()
header={'user-agent':'Mozilla/5.0 (X11; Linux i686; …) Gecko/20100101 Firefox/62.0'}
while True:
url=to_crawl[0]
try:
req=requests.get(url, headers=header)
except:
to_crawl.remove(url)
crawled.add(url)
continue
html=req.text
links=re.findall(r'<a href="?\'?(https?:\/\/[^"\'>]*)', html )
print("Crawling:", url)
to_crawl.remove(url)
crawled.add(url)
for link in links:
if link not in crawled and link not in to_crawl:
to_crawl.append(link)
You have a loop infinite that does not possess a
break
. Sooner or later the values will end into_crawl
, there is no index 0, because the list will be done. When your code should stop running?– Woss
The program would end when I went all over the site in search of links, when this error appeared I could not find the problem. In that case I would break ?
– C.J
So instead of
while True
could bewhile to_crawl
– Woss
Not giving the same error, but the program is not going through the site as it should, simply skip the instruction and end the program, I believe it is because of the to_crawl no while !
– C.J