Error printing HTML with Beautifulsoup

Asked

Viewed 88 times

1

I have a simple code that accesses a quiz site and takes all the ul which contain the class square and prints on screen.

url = "http://quizdomilhao.com.br/category/g1"
question_page = requests.get(url, headers=headers)
soup = BeautifulSoup(question_page.text, 'html.parser')
print(soup.find_all('ul',class_="square"))

But when I run this code it returns the whole html of the site. Someone help me how I can fix this with Beautifulsoup?

1 answer

2


import requests
from bs4 import BeautifulSoup as bs

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'}

url = "http://quizdomilhao.com.br/category/g1"

question_page = requests.get(url, headers = headers )
question_page.encoding = 'utf-8'

soup = bs(question_page.text, 'html.parser')

ul = soup.find_all('ul', {'class':'square'})

lis = [item.find_all('li') for item in ul]
lis = [item for sublista in lis for item in sublista]

aas = [item.find_all('a') for item in ul]
aas = [item for sublista in aas for item in sublista]

text_link = [[item.text, item2['href']] for item, item2 in zip(lis,aas)]
  1. Importing the libraries
  2. Creating a header for the site to accept the request
  3. Making the request using requests
  4. Using Bs to extract html tags
  5. Searching the ul with the square class
  6. Using the return of the previous query to extract only the contents of the li
  7. Seeking 'a' within ul
  8. Creating a list of contents

Update to access the reply pages

for item in text_link:
    question, link = item
    print(question)
    print(link)
    answer_page = requests.get(link, headers=headers)
    answer_page.encoding = 'utf-8'
    soup = bs(answer_page.text, 'html.parser')
    ul = soup.find('ul', {'class':'square'})
    li = ul.find_all('li')  
    answer = [item.find('strong').text for item in li if item.find('strong')]
    print(''.join(answer))
  • Hello @Imonferrari how do I make the characters appear accentuated?

  • 1

    @Joaroque, good morning! I added an update to the question: question_page.encoding = 'utf-8'. Hug!

  • @Imonferrari, so far so good! Now how do I extract the links contained in li? Whenever I try to error! Because I want to actually save the questions and answers in the database and to get the answer I have to go to the answer page and get the correct answer

  • 1

    @Joaroque, I did another update to resolve the link issue. Hug!

  • 1

    @Joaroque, I don’t understand. If the doubts of this question have been answered it is better to create a new question with the new questions, so it is easier to understand what you need. What do you think? Hug!

  • 1

    @Joaroque, I updated the code because the question page contains more than one square class. Hug!!

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.