Python Beautifulsoup remove tag within tag

Question

Python Beautifulsoup remove tag within tag

Asked 4 years, 4 months ago

Viewed 42 times

0

I’m having a problem while making a Scrap of a page and capturing text.

Basically the beginning of my code is as follows:

url0 = 'https://www.service.bund.de/Content/DE/Ausschreibungen/Suche/Formular.html?nn=4641482&cl2Addresses_Adresse_State=nordrhein-westfalen&resultsPerPage=100'

r = requests.get(url0,headers={'User-Agent': 'Mozilla/5.0'}) 
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.find('ul', {"class": "result-list"})
links = content.find_all('a')

Each row of the table of the site I am trying to search for is an element of the "links" list. Well, then I want to take the first column (Ausschreibung) that is inside the H3 tag inside each element of the links list. Only this tag has a second tag embedded:

# Usando um exemplo de elemento links:
y = links[0]

b = y.find('h3')
b
# output: '<h3><em>Ausschreibung</em>Erneuerung SDRL 3</h3>'

The problem is that when I go get the text of these tags my machine (Windows 10) is "reading" also the tag and translating everything wrong:

c = y.find('h3').text
c
# Output: 'AusschreibungEr\xadneue\xadrung SDRL 3'

Using get_text() gives the same result.
What interests me inside object b is "Er-Neue-Rung SDRL 3". How can I pass everything to text ('Ausschreibung Er-Neue-Rung SDRL 3" or delete the tag 'em' inside b to stay with the text "Er-Neue-Rung SDRL 3" ?

1 answer

Browser other questions tagged python python-3.x beautifulsoup

You are not signed in. Login or sign up in order to post.

by avqpereira • 46 points · Answer 1 · 2021-03-12T15:18:36+00:00

Regex can be a path:

from bs4 import BeautifulSoup
import requests
import re

url0 = 'https://www.service.bund.de/Content/DE/Ausschreibungen/Suche/Formular.html?nn=4641482&cl2Addresses_Adresse_State=nordrhein-westfalen&resultsPerPage=100'

r = requests.get(url0,headers={'User-Agent': 'Mozilla/5.0'}) 
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.find('ul', {"class": "result-list"})
links = content.find_all('a')

y = links[0]

b = y.find('h3')

# Texto que queremos remover
em_tag_text = b.find('em').get_text()

# Texto sujo, que se inicia pelo texto que queremos remover
messy_text = b.get_text()

# Limpando
clean_text = re.sub(rf"^{em_tag_text}", '', messy_text)
clean_text