Doubts about the Use of Beautifulsoup

Asked

Viewed 125 times

0

My code below is to take the genre of the movies of the site IMDB, however I’m not knowing to take the tag in specific genres of the site, because sometimes instead of it catch the genre he takes the tag of Keywords, because he takes the first div that he thinks.

def get_genero(soup):
genero = soup.find  ('div', attrs={'class':'see-more inline canwrap'})
print(genero)
if genero != None:
    return [a.text for a in genero.findAll('a')]
else:
    return None

Being that I need to take only the genres of the movies of the site IMDB. I wanted to know how to get a specific place, using Beautifulsoup.

Link to an example movie page:

https://www.imdb.com/title/tt4575576/? ref_=adv_li_tt

  • Provide the link to the page you are making Scrapping.

  • https://www.imdb.com/title/tt4575576/? ref_=adv_li_tt

2 answers

2


The problem is in the selector you are looking for, there are several <div> with these three classes together across the page. The ideal is to try to create a selector that is as specific as possible to what you are trying to get (some browsers provide the feature of "copy selector" or "copy xpath", for a specific element when displayed in the "inspect element").

Visualizing the structure of the page, you can see that the genres are inside the room <div> within the element with id='titleStoryLine'. Then you can use the same css selector scheme to get the element:

from requests import get
from bs4 import BeautifulSoup as bs

soup = bs(get('https://www.imdb.com/title/tt4575576/?ref_=adv_li_tt').text)

genres = soup.select('#titleStoryLine div:nth-of-type(4) a')

for genre in genres:
   print(genre.text)

Resulting in:

Animation
Adventure
Comedy
Drama
Family
Fantasy
  • 1

    Good afternoon, that way appeared None as a result

  • I had used the function find instead of select, corrected.

0

One of the solutions I came across was to take the entire div of history line, and then find/select a href containing that part of words, as follows:

def get_genero(soup):
genero = soup.find('div', {'id' : 'titleStoryLine'})
genero = genero.select("a[href*=/search/title?genres]")
if genero != None:
    return [a.text for a in genero]
else:
    return None

Browser other questions tagged

You are not signed in. Login or sign up in order to post.