Building a Web Scraping with Python

Asked

Viewed 32 times

-2

all right? Well, I’m a programming enthusiast, and I took a tutorial to make webscraping. The logic I even understood, however I am facing a problem when one of the data is missing on the site, below I leave the code and my analysis of the problem faced:

import requests
from bs4 import BeautifulSoup

URL = "https://www.classcentral.com/subject/data-science"
page = requests.get(URL)


soup = BeautifulSoup(page.content, "html.parser")


Course = []
Duration = []
Start_Date = []
Offered_By = []
No_Of_Reviews = []
Rating = []

def find_2nd(string, substring):
    return string.find(substring, string.find(substring) + 1)

def find_1st(string, substring):
    return string.find(substring, string.find(substring))

for i in soup.findAll("span",{'class' : 'text-1 weight-semi line-tight'}):
    b = str(i)
       
    #print(b  [  find_1st(b,'>')+1  :  find_2nd(b,'<')  ]  )
    Course.append(b[find_1st(b,'>')+1:find_2nd(b,'<')])

course = []
for i in Course:
    i = i.strip()
    print(i)
    course.append(i)


# # Num of Reviews
for i in soup.findAll("span",{'class' : 'large-down-hidden block line-tight text-4 color-gray'}):
    b = str(i)
    print(b[find_1st(b,'>')+1:find_2nd(b,'<')])
    No_Of_Reviews.append(b[find_1st(b,'>')+1:find_2nd(b,'<')])

Well, entering the site and doing a search, there is a course that is found without Views. The problem with all this is that when I turn this into a Dataframe, the length error occurs. That is, I can not generate Dataframe, because of this missing value. I didn’t put in the complete code to make it longer, because the rest is working.

Would anyone know how to help me ? How would you make the code understand that this information does not exist, put the value of 0 and continue with the implementation of the code.

1 answer

0


What causes this problem is that elements without Views and elements with multiple Views have completely different classes. Here are two examples I took from the page:

<span class="large-down-hidden block line-tight text-4 color-gray">24 Reviews</span>

<span class="medium-down-hidden text-4 color-gray italic">No reviews yet.</span>

So I modified the program to check if an element has the classes in the first example, because this is the most common case. If not, the program looks for an element with the classes medium-down-hidden text-4 color-gray italic, because it indicates that he has no review.

You also don’t need to use the functions find_1st and find_2nd to read the text inside the element span, because it’s so much easier to just type nome_do_objeto.text. The result is the same.

import requests
from bs4 import BeautifulSoup

URL = "https://www.classcentral.com/subject/data-science"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

td_list = soup.findAll("td", {"class": "hide-on-hover fill-space relative"})
for i, td in enumerate(td_list):
    try: # Classe com > 0 reviews
        reviews = True
        my_span = td.findChildren("span", {"class" : "large-down-hidden block line-tight text-4 color-gray"})[0]
    except IndexError: # Classe com 0 reviews
        reviews = False
        my_span = td.findChildren("span", {"class" : "medium-down-hidden text-4 color-gray italic"})[0]

    output = " ".join(my_span.text.split())

    if reviews:
        print(i, "\t", output)
    else:
        print(i, "\t", output, "<-- No reviews!")
  • 1

    Our Krossbow, thank you so much for your help friend. Thank you very much.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.