Python Beautifulsoup remove tag within tag

Asked

Viewed 42 times

0

I’m having a problem while making a Scrap of a page and capturing text.

Basically the beginning of my code is as follows:

url0 = 'https://www.service.bund.de/Content/DE/Ausschreibungen/Suche/Formular.html?nn=4641482&cl2Addresses_Adresse_State=nordrhein-westfalen&resultsPerPage=100'

r = requests.get(url0,headers={'User-Agent': 'Mozilla/5.0'}) 
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.find('ul', {"class": "result-list"})
links = content.find_all('a')

Each row of the table of the site I am trying to search for is an element of the "links" list. Well, then I want to take the first column (Ausschreibung) that is inside the H3 tag inside each element of the links list. Only this tag has a second tag embedded:

# Usando um exemplo de elemento links:
y = links[0]

b = y.find('h3')
b
# output: '<h3><em>Ausschreibung</em>Er­neue­rung SDRL 3</h3>' 

The problem is that when I go get the text of these tags my machine (Windows 10) is "reading" also the tag and translating everything wrong:

c = y.find('h3').text
c
# Output: 'AusschreibungEr\xadneue\xadrung SDRL 3'

Using get_text() gives the same result.
What interests me inside object b is "Er-Neue-Rung SDRL 3". How can I pass everything to text ('Ausschreibung Er-Neue-Rung SDRL 3" or delete the tag 'em' inside b to stay with the text "Er-Neue-Rung SDRL 3" ?

1 answer

2


Regex can be a path:

from bs4 import BeautifulSoup
import requests
import re

url0 = 'https://www.service.bund.de/Content/DE/Ausschreibungen/Suche/Formular.html?nn=4641482&cl2Addresses_Adresse_State=nordrhein-westfalen&resultsPerPage=100'

r = requests.get(url0,headers={'User-Agent': 'Mozilla/5.0'}) 
soup = BeautifulSoup(r.text, 'html.parser')
content = soup.find('ul', {"class": "result-list"})
links = content.find_all('a')

y = links[0]

b = y.find('h3')

# Texto que queremos remover
em_tag_text = b.find('em').get_text()

# Texto sujo, que se inicia pelo texto que queremos remover
messy_text = b.get_text()

# Limpando
clean_text = re.sub(rf"^{em_tag_text}", '', messy_text)
clean_text

Browser other questions tagged

You are not signed in. Login or sign up in order to post.