Web Scraping with Pandas - How to treat values that are null in the collection and how to concatenate two columns in the final result?

Question

Web Scraping with Pandas - How to treat values that are null in the collection and how to concatenate two columns in the final result?

Asked 5 years ago

Viewed 118 times

-1

I’m making a Web Scrap using Python and Pandas, on Windows. I am collecting the data from the page, generating a Dataframe in Pandas and then exporting to an Excel spreadsheet. I’m not using any databases in this case. I’m in two trouble:

I need to collect the name and price of the product, but on the site page, some products do not have the price.... then the dataframe takes the price of the next product and plays on the product that has the null price, generating wrong information. How can I make this right?
In the same case, the price on the site has the price separate from the pennies, each in a different class from the HTML code. I can recover both information, but how do I concatenate? It is very bad one column with the price and the other with the cents.....

Follows part of the code I’m using:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.s_2'
page = requests.get(url)

soup = BeautifulSoup(page.text,'html.parser')

review_text = []
review_text_elem = soup.find_all(class_='a-link-normal a-text-normal')

for item in review_text_elem:
    review_text.append(item.text)

user_name = []
user_name_elem = soup.find_all(class_='a-price-whole')

for item in user_name_elem:
    user_name.append(item.text)

review_price = []
review_price_elem = soup.find_all(class_='a-price-fraction')

for item in review_price_elem:
    review_price.append(item.text)
    print(review_price)

final_array = []

for text, user, cents in zip(review_text, user_name,review_price):
    
    final_array.append({'Produto': text.replace("\n", ""), 'Preço': user, 'Centavos': cents})
    
col = 'Produto Preço Centavos'.split()

df = pd.DataFrame(final_array, columns= col)
print(df)
    
df.to_excel('amazonpanda4.xlsx',index=False)

You can format the code by placing 3 crases before and 3 after. Formatting in Python is essential.

– Paulo Marques

2020/10/15 at 14:56
Hello Paul thank you for your help!! &#X;Could you give me an example of what this solution would look like so I can better understand it? Thank you!!!

– Daniela Martinez

2020/10/15 at 16:47
Daniela, good night! The question has already been solved? If not solved could put from which page you need to extract the data?

– lmonferrari

2020/10/15 at 22:37

1 answer

Browser other questions tagged python web-application pandas web-scraping null

You are not signed in. Login or sign up in order to post.

by Paulo Marques • **3,739** points · Answer 1 · 2020-10-15T17:23:26+00:00

Without knowing the structure of the web page that you are collecting the data, it becomes complicated to help.

However, I believe that the logic used is the problem.

See the example below:

from bs4 import BeautifulSoup

html = """
<html>
    <body>
        <span>
            <p id="nome">Banana</p>
            <p id="preco">10</p>
        </span>
        <span>
            <p id="nome">Abacaxi</p>
        </span>
        <span>
            <p id="nome">Morango</p>
            <p id="preco">20</p>
        </span>
    </body>
</html>
"""

soup = BeautifulSoup(html)

With this page, if you get all (find_all) the id=nome you will have:

>>> soup.find_all("p", {"id": "nome"})
[<p id="nome">Banana</p>, <p id="nome">Abacaxi</p>, <p id="nome">Morango</p>]

And doing the same with i id=preco

>>> soup.find_all("p", {"id": "preco"})
[<p id="preco">10</p>, <p id="preco">20</p>]

And that is where the problem lies. For when you put the two together, you will have:

Banana => 10
Abacaxi => 20
Morango => null (nan)

The solution would have to be like:

>>> span_soup = soup.find_all("span")

>>> for sp in span_soup:
...     item = sp.find('p', {"id": "nome"})
...     preco = sp.find('p', {"id": "preco"})
...     print(item, preco)
...
<p id="nome">Banana</p> <p id="preco">10</p>
<p id="nome">Abacaxi</p> None
<p id="nome">Morango</p> <p id="preco">20</p>

Of course the output has to be treated, take only the text and play on pandas. But, I think the camipo is this.

I hope I’ve helped.