Remove text from parent element keeping daughters in BS4 in Python

Asked

Viewed 140 times

-1

How to remove improper text by keeping the title, paragraph, and link.

Code:

<html>
  <body>
   <header>
    <h1>
     Titulo
    </h1>
   </header>
   <p>Hello World</p>
   Texto Indevido
   <a href="">link</a>
  </body>
<html>

Expected:

<html>
  <body>
   <header>
    <h1>
     Titulo
    </h1>
   </header>
   <p>Hello World</p>
   <a href="">link</a>
  </body>
<html>

Trying to get only Improper text:

soup = BeautifulSoup(codigo_html)
body = soup.body.text
print(body)

Upshot:

  Titulo
  Hello World
  Texto Indevido
  link
  • Why not access the field text of the title only instead of the body? title = soup.body.header.h1.text.strip()

  • This code is just an example to simplify, in the real page I need to delete the improper text but it contains several tags inside the body and in the header. My problem is in delete the improper text that is within the body tags and keep the H1 tag

  • But if you only want what you’re in <h1>, why take all the rest and delete? The example I gave only takes the value in <h1> as it needs; as nothing else comes, you need not delete.

  • I tried to clarify my problem, in this new example taking only from H1 would make me lose the other information. If it’s the body I’d get the text I don’t want to leave in the code.

  • Okay, now it’s more befitting. Now the question is: what conditions should a text meet to be considered unwanted?

  • Text is considered undesirable to anyone who is out of a body daughter tag.

Show 1 more comment

1 answer

0


In a simple way just get all tags and generate a new html with them. As the Improper text is not tag, it is not put in the new code.

That’s all it takes:

codigo_html = ''
for i in soup.find('body'):
    if "<class 'bs4.element.Tag'>" == str(type(i)):
        codigo_html = codigo_html + str(i)

soup = BeautifulSoup(codigo_html, 'html5lib')

The html5lib-generated and built-in Soup with every html structure resulting in:

<html>
 <body>
   <header>
    <h1>
     Titulo
    </h1>
   </header>
   <p>Hello World</p>
   <a href="">link</a>
  </body>
<html>

Browser other questions tagged

You are not signed in. Login or sign up in order to post.