Python: Cleaning html code

Asked

Viewed 168 times

1

Using python, what would be the easy way to clear tag parameters coming from microsoft tools?

Initially I’m trying to transform via Beautiful Soup, but I’m open to all suggestions! :D

In this way:

<p style="text-decoration: underline;">Hello <strong>World!</strong></p>
<p style="color: #228;">How are you today?</p>
<table style="width: 300px; text-align: center;" border="1" cellpadding="5">
<tr>
<th width="75"><strong><em>Name</em></strong></th>
<th colspan="2"><span style="font-weight: bold;">Telephone</span></th>
</tr>
<tr>
<td>John</td>
<td><a style="color: #F00; font-weight: bold;" href="tel:0123456785">0123 456 785</a></td>
<td><img width="25" height="30" src="images/check.gif" alt="checked" /></td>
</tr>
</table>

For this form:

<p>Hello <strong>World!</strong></p>
<p>How are you today?</p>
<table border="1" cellpadding="5">
<tr>
<th width="75"><strong><em>Name</em></strong></th>
<th colspan="2"><span>Telephone</span></th>
</tr>
<tr>
<td>John</td>
<td><a href="tel:0123456785">0123 456 785</a></td>
<td><img width="25" height="30" src="images/check.gif" alt="checked" /></td>
</tr>
</table>
  • Manage to put your attempt with the Beautiful Soup? By the way, basically what you need is to remove the properties style?

  • Yes. Remove all of them.

1 answer

3


You can use the re.sub()

Example to remove attributes style:

import re

html_string = "[coloque aqui seu HTML]"
html_no_style = re.sub(r' style="[^"]+"', '', html_string)

It is important to test with several different HTML files to know if you will not need to improve Regex capture.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.