Difficulty removing Child, Python

Asked

Viewed 46 times

2

Good morning friends. I’m having trouble removing a Child. I wrote a code to collect all the prices of the products of a site (this is a list of products, not a page for each one). As for this without problems, the code works well. It happens that sometimes some product goes on sale, and on the site are 2 prices, the old and the new (at discount), and my code pulls both. The old price is not interesting, so I wanted to ignore it when I’m pulling the data, but I’m not getting it to happen. An example of the source code:

<div class="result-actions"
  <span> ==$0
    $ 1,98
  </span>
<div class="result-actions">
  <span>
    <small class="price-before"> ==$0
      $ 56,70
    </small>
    <span class="price-now">
      $ 39,60
    </span>
  </span>

Each "result-actions" represents a product. I was suggested to pull the "price-now", but in this case the first product of the example would not be pulled by my code, since it is not on sale and therefore does not contain the class. My code trying to delete Child, but unsuccessfully:

with open('Lista.csv') as example_file:
  example_reader = csv.reader(example_file)
  for row in example_reader:
      driver.get(row[0])
      html = driver.page_source
      bs = BeautifulSoup(html, 'html.parser')
      precosLista = bs.findAll('div',{'class':'result-actions'})
      f = open(acha_proximo_nome('Arquivo.csv'), 'wt+', newline='')
      writer = csv.writer(f)

      try:
          for precos in precosLista:
              print(precos.get_text())
              csvPreco = []
              csvPreco.append(clean_up_text(precos.get_text()))
              js = "var aa = document.getElementsByClassName('price-before')[0];aa.parentNode.removeChild(aa)"
              driver.execute_script(js)
              writer.writerow(csvPreco)

      finally:
          f.close()

Without the

js = "var aa = document.getElementsByClassName('price-before')[0];aa.parentNode.removeChild(aa)"
driver.execute_script(js)

my code runs fine, but it’s like I said, it collects everything, including what I don’t want. Anyone has any idea how I can fix this?

1 answer

2


Since you are using Beautifulsoup, you can use the method replace_with that each node contains. It allows you to swap the contents of the tag with a specific html. In case I swapped the contents for an empty string in the code example below:

import bs4

html = '''<div class="result-actions">
<span>
  $ 1,98
</span>
</div>
<div class="result-actions">
<span>
  <small class="price-before">
    $ 56,70
  </small>
  <span class="price-now">
    $ 39,60
  </span>
</span>
</div>'''

soup = bs4.BeautifulSoup(markup=html)
prices = soup.find_all('div', {'class':'result-actions'})

for price in prices:
    # remove o preco antigo
    smalls = price.find_all('small')
    for small in smalls:
        small.replace_with('')

    value = price.find_all('span')[0].text.strip()
    print (value)

The result of this code must print the values correctly for this HTML:

> $ 1,98
> $ 39,60
  • 1

    Dude, perfect! It worked out great! And for you to see, it’s a simple solution even, but it didn’t even cross your mind! hahaha Thank you, really! Solved 2 days of head-breaking :p

Browser other questions tagged

You are not signed in. Login or sign up in order to post.