Web scraping with Beautifulsoup - find_next does not return text

Question

Web scraping with Beautifulsoup - find_next does not return text

Asked 5 years, 3 months ago

Viewed 345 times

0

I want to extract the text from the section below:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000">Mon 9 Mar 2020</div>

the text would be "Mon 9 Mar 2020". But when I do:

date = match_bar[0].find_next('div', {'class': 'matchDate renderMatchDateContainer'})

I have as return the following, without the text itself:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000"></div>

When I add.text the return is empty. I don’t have much experience with HTML.

UPDATE:

I realized that when I execute the code:

my_url = 'https://www.premierleague.com/match/{}'.format(i)
client = urlopen(my_url)
page_html = client.read()

The passage in question already appears like this, without the text:

<div class="matchDate renderMatchDateContainer" data-kickoff="1583784000000"></div>

While in the browser I can see the text:

Could anyone help? Thank you.

2 answers

0

Now that I have access to the link I understand better what your problem is and let’s talk about it.

The reason you are not getting it is because the site is rendered when loading the page, making a request to get the HTML code it comes back with only the composite HTML body as it is only filled with the page loading.

Let’s go to the solution, one of the possible solutions and the best and that I recommend you is to use the automation library Lenium and to consume the least processing possible we will add an argument for it to load the page in hidden mode, so it will not display what is being opened by the automated browser, with it it will be possible to load the page and then get the HTML body already filled with the values.

I strongly recommend if you have never worked with Selenium to read the documentation, as you will need to download the driver and edit the path described in "executable_path". I will leave a code below with the solution of the problem:

from bs4 import BeautifulSoup
from selenium import webdriver

def obterCodigoFonte(url):
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(executable_path=r'.\chromedriver.exe', chrome_options=chrome_options)
    driver.get(url)
    return driver.page_source

def processarCodigoFonte(cf):
    soup = BeautifulSoup(cf, 'html.parser')
    getValueFromDiv = soup.find('div', class_='matchDate renderMatchDateContainer')
    return getValueFromDiv.text


url = 'https://www.premierleague.com/match/46889'
codigoFonte = obterCodigoFonte(url)
print(processarCodigoFonte(codigoFonte))

Browser other questions tagged html python web-scraping beautifulsoup

You are not signed in. Login or sign up in order to post.

by Jefferson Matheus Duarte • **168** points · Answer 1 · 2020-04-14T18:28:32+00:00

Hello, all right?

If your intention is to extract the snippet from a specific div the way you are trying to extract the information is wrong, just use the bs4 find function specifying which class this div is.

I will provide an example of what the code would look like:

from bs4 import BeautifulSoup

html = """<div class='matchDate renderMatchDateContainer' 
          data-kickoff='1583784000000'>Mon 9 Mar 2020</div>"""

soup = BeautifulSoup(html, 'html.parser')

getValueFromDiv = soup.find('div', class_='matchDate renderMatchDateContainer').text

print(getValueFromDiv)

the result of that input was:

Mon 9 Mar 2020