Problem collecting links from a website

Asked

Viewed 118 times

0

Dear friends, good morning! I am writing a program in Python to collect the links of a website. The part of the code that collects the links is:

links = driver.find_elements_by_xpath('//*[@href]')
for link in links:
    print(link.get_attribute('href'))
time.sleep(1)

I tested it on some websites and ran it good. The problem is when I use it in Ifood. It collects some links and then returns several errors. I’m very new to programming, so I don’t know what these mistakes mean and how I can get around them. If anyone can help me, I would be very grateful! Thanks =)

What the code returns:

https://d1jgln4w9al398.cloudfront.net/imagens/ce/wl/www.ifood.com.br/favicon.ico
https://d1jgln4w9al398.cloudfront.net/site/2.1.238-20181023.22/css/main.css
https://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800
https://www.ifood.com.br/

Traceback (most recent call last):
  File "C:\Users\jorda\Desktop\Python - Projetos\digitar ifood.py", line 32, in <module>
    print(link.get_attribute('href'))
  File "C:\Users\jorda\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 143, in get_attribute
    resp = self._execute(Command.GET_ELEMENT_ATTRIBUTE, {'name': name})
  File "C:\Users\jorda\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\jorda\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "C:\Users\jorda\AppData\Local\Programs\Python\Python37-32\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=70.0.3538.77)
  (Driver info: chromedriver=2.42.591088 (7b2b2dca23cca0862f674758c9a3933e685c27d5),platform=Windows NT 10.0.17134 x86_64)
  • All right? I took a look at the site Ifood, one of the "href" has "#content", maybe your code does not understand this type of information. It’s like a link reference. Staleelementreferenceexception: Your program has found an element that will be changed by Javascript ("#content"). This user has one answer that can help you: https://stackoverflow.com/a/43879738/8152489

  • Looking at the source code, the links I want are inside elements that are in formats like this: <a class="Restaurant-card-link" href="/delivery/Jundiai-sp/china-in-box---Jundiai-anhangabau" data-Event="Selectionourestaurante" data-Rid="a49ecb3c-bb1a-461d-aab0-5f8bf3ecaa48" data-name="China in Box Jundiaí" data-price="60" data-Brand="" data-Category="Chinese" data-cuisine="Chinese" data-position="3" data-pos="3" data-Evaluation="4.3" data-Distance="9.92" data-delivery-time="50" title="Ordering China in Box - Jundiaí | iFood Delivery">

  • You reported that it collects some links and after that error. What it collects is what you want? What can prevent your program from error is: Try putting a decision-making to prevent it from capturing href that contains Javascript, words starting with "#", for example.

  • So, I think it’s the other way around. He’s collecting some links that I didn’t want and just the ones I want, he’s taking it. Is it because of Java? Do you have any idea how I can modify it to work, or any indication where I can figure out how to do it?

  • Hey! I found a similar problem to yours, here’s the link: https://stackoverflow.com/questions/43877140/get-all-links-from-driver-find-elements-by-href-not-working I hope it helps!

  • Wanderson, thanks for the help. The link problem really looks like mine (apparently). I applied the solution, but the errors continued. In this solution, the elements are sought by the tag. As it did not happen, I tried by xpath and by class_name, but in these cases the only thing returned was a []. I will continue in the fight here! haha

Show 1 more comment
No answers

Browser other questions tagged

You are not signed in. Login or sign up in order to post.