Xpath with Python - Pick up text after tag in a div

Asked

Viewed 769 times

1

I’m trying to get a text after a tag that’s inside a div, in an html. The problem I’m having is that I’m not getting the text, just an empty string. I’ve looked elsewhere and I haven’t seen anyone with a similar problem :/

Here comes the html code:

<div class="list-view-item-title-wrapper">
    <div class="list-view-item-title-top">
        <div class="list-view-item-type">
            "Webcast"
        </div>
    </div>
    <a href="/resources/actionable-awareness-unlock-your-influence" class="list-view-item-title">
        <h2>
            "Actionable Awareness: Unlock Your Influence"
        </h2>
    </a>
    <div class="list-view-item-date">
        <i class="fa fa-calendar"></i>
        "September 24, 2020"
    </div>
    ...
</div>

And the python:

def get_posts_elements(self, html):
    posts = self.get_posts(html)

    # - get_posts -> retorna html.xpath("//div[@class='list-view-item-title-wrapper']")
    # - html -> lxml.html.fromstring(requests.get('https://www.scrum.org/resources'))
    
    for post in posts:

            # --- Recebendo com sucesso:
        try:
            self.data['Type'].append(post.xpath(".//div[@class='list-view-item-type']")[0].text.strip())
        except:
            self.data['Type'].append('')

        try:
            self.data['Title'].append(post.xpath(".//a[@class='list-view-item-title']/h2")[0].text.strip())
        except:
            self.data['Type'].append('')
        
        try:
            self.data['Link'].append(urljoin(self.base_url, post.xpath(".//a[@class='list-view-item-title']/@href")[0]))
        except:
            self.data['Link'].append('')


            # --- Recebendo com falha:
        data = post.xpath(".//div[@class='list-view-item-date']")[0].text
        print(data)

In this case, I want to pick up the texts referring to the dates of each post, as I do with the title and type. In the above example it would be "September 24, 2020" but I only get an empty string.

My time:

import lxml.html as parser
import requests
from urllib.parse import urlsplit, urljoin
  • It has to be with xpath ?? Have you tried using Beautifulsoup with Selenium and always have good results. - Beautiful Soup - Selenium

  • I used Beautifulsoup and Selenium, I also had good results, but this time I need to use only Xpath :/

1 answer

1

I believe I was able to solve using the concepts of inheritance in Xpath. I used

post.xpath(".//div[@class='list-view-item-date']/descendant-or-self::*/text()")[1])

Instead of

post.xpath(".//div[@class='list-view-item-date']")[0].text

/Descendant-or-self::* is, briefly, being used to catch all daughters/granddaughters of the node, more broadly. So I finally managed to identify the text. I also needed to change the index, since the element I want is always the second on the list.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.