-1
How to remove improper text by keeping the title, paragraph, and link.
Code:
<html>
<body>
<header>
<h1>
Titulo
</h1>
</header>
<p>Hello World</p>
Texto Indevido
<a href="">link</a>
</body>
<html>
Expected:
<html>
<body>
<header>
<h1>
Titulo
</h1>
</header>
<p>Hello World</p>
<a href="">link</a>
</body>
<html>
Trying to get only Improper text:
soup = BeautifulSoup(codigo_html)
body = soup.body.text
print(body)
Upshot:
Titulo
Hello World
Texto Indevido
link
Why not access the field
text
of the title only instead of thebody
?title = soup.body.header.h1.text.strip()
– Woss
This code is just an example to simplify, in the real page I need to delete the improper text but it contains several tags inside the body and in the header. My problem is in delete the improper text that is within the body tags and keep the H1 tag
– Igor Gabriel
But if you only want what you’re in
<h1>
, why take all the rest and delete? The example I gave only takes the value in<h1>
as it needs; as nothing else comes, you need not delete.– Woss
I tried to clarify my problem, in this new example taking only from H1 would make me lose the other information. If it’s the body I’d get the text I don’t want to leave in the code.
– Igor Gabriel
Okay, now it’s more befitting. Now the question is: what conditions should a text meet to be considered unwanted?
– Woss
Text is considered undesirable to anyone who is out of a body daughter tag.
– Igor Gabriel