Python: how to edit tag with bs4

Asked

Viewed 229 times

0

I have an html code, well polluted with Style in almost all tags, plus tags <font><span> unnecessary.

How can I use beautifulsoup, to remove only attrs=style in <p> and the tags <font><spam> without removing its contents, and preserving the other parents and Children tags?

I need to browse all html code elements automatically and still can’t, but I’ve already managed to remove an element with the encoding below:

print(soup.span)
print(soup.span.name)
del soup.span['style']

Follows the html:

<html xmlns="http://www.w3.org/TR/REC-html40">

<head>
	<meta name="GENERATOR" content="Microsoft FrontPage 6.0">









<style>
<!--
h1
	{margin-top:6.0pt;
	margin-right:0cm;
	margin-bottom:0cm;
	margin-left:0cm;
	margin-bottom:.0001pt;
	text-align:justify;
	text-indent:76.85pt;
	page-break-after:avoid;
	tab-stops:99.75pt;
	font-size:14.0pt;
	font-family:"Times New Roman";
	}
 table.MsoNormalTable
	{mso-style-parent:"";
	font-size:10.0pt;
	font-family:"Times New Roman"}
div.Section1
	{page:Section1;}
h4
	{margin-bottom:.0001pt;
	text-align:center;
	page-break-after:avoid;
	tab-stops:70.9pt;
	font-size:12.0pt;
	font-family:"Times New Roman";
	font-weight:bold; margin-left:0cm; margin-right:0cm; margin-top:0cm}
h2
	{margin-bottom:.0001pt;
	text-align:center;
	page-break-after:avoid;
	font-size:10.0pt;
	font-family:Arial;
	font-weight:bold;
	margin-left:0cm; margin-right:0cm; margin-top:0cm}
h3
	{margin-bottom:.0001pt;
	text-align:center;
	text-indent:17.85pt;
	page-break-after:avoid;
	font-size:10.0pt;
	font-family:Arial;
	font-weight:bold; margin-left:0cm; margin-right:0cm; margin-top:0cm}
div.Section2
	{page:Section2;}
div.Section3
	{page:Section3;}
h6
	{margin-bottom:.0001pt;
	text-align:center;
	page-break-after:avoid;
	font-size:12.0pt;
	font-family:"CG Times";
	margin-left:0pt; margin-right:0pt; margin-top:0pt}
h5
	{margin-bottom:.0001pt;
	page-break-after:avoid;
	font-size:12.0pt;
	font-family:"Times New Roman";
	font-weight:normal; margin-left:0cm; margin-right:0cm; margin-top:0cm}
span.msoDel
	{mso-style-name:"";
	text-decoration:line-through;
	color:red}
span.msoIns
	{mso-style-name:"";
	text-decoration:underline;
	text-underline:single}
span.Ttulo1Car
	{font-family:Arial;
	font-weight:bold;
	}
span.CaracteresdeNotadeRodap
	{vertical-align:super}
span.WW-Refdenotaderodap1234
	{mso-style-parent:"";
	vertical-align:super}
span.msoins0
	{}
span.MsoFootnoteReference
	{vertical-align:super;}
div.Section4
	{page:Section4;}
span.MsoCommentReference
	{}
span.Hiperlink
	{mso-style-parent:"";
	color:blue;
	text-decoration:underline;
	text-underline:single}
span.txtterm1
	{font-family:"Times New Roman";
	color:black;
	font-weight:bold}
span.Absatz-Standardschriftart
	{}
span.apple-style-span
	{}
span.apple-converted-space
	{}
div.Section5
	{page:Section5;}
div.Section6
	{page:Section6;}
div.Section7
	{page:Section7;}
div.Section8
	{page:Section8;}
div.section1
	{margin-right:0cm;
	margin-left:0cm;
	font-size:8.0pt;
	font-family:"Arial Unicode MS";
	}
span.msoChangeProp
	{mso-style-name:"";
	color:black}
span.texto
	{}
span.highlightedsearchterm
	{}
span.MsoHyperlink
	{color:blue;
	text-decoration:underline;
	text-underline:single;}
div.Section9
	{page:Section9;}
span.MsoHyperlinkFollowed
	{mso-style-parent:"";
	color:purple;
	text-decoration:underline;
	text-underline:single;}
span.WW-Fontepargpadro
	{}
span.Fontepargpadro2
	{}
span.Internetlink
	{mso-style-parent:"";
	color:navy;
	text-underline:#000000;
	text-decoration:underline;
	text-underline:single}
span.Refdecomentrio1
	{}
span.Refdecomentrio2
	{}
span.font0020style31char
	{font-family:"Times New Roman","serif";
	}
span.style10char
	{font-family:"Times New Roman","serif";
	}
span.centralizadochar
	{font-family:"Times New Roman","serif";
	}
span.texto0020normalchar
	{font-family:"Times New Roman","serif";
	}
span.normalchar
	{font-family:"Times New Roman","serif";
	}
span.estilochar
	{font-family:"Times New Roman","serif";
	}
span.style21char
	{font-family:"Times New Roman","serif";
	}
span.style18char
	{font-family:"Times New Roman","serif";
	}
span.style15char
	{font-family:"Times New Roman","serif";
	}
span.mw-headline
	{}
span.hlhilite
	{}
span.field1
	{mso-style-parent:"";
	font-family:"Verdana","sans-serif";
	color:black;
	border:1.0pt solid windowtext;
	padding:0cm;
	background:white}
span.hps
	{font-family:"Times New Roman","serif";
	}
span.themebody
	{font-family:"Times New Roman","serif";
	}
span.FootnoteSymbol
	{font-family:"Times New Roman","serif";
	position:relative;
	top:0pt;
	vertical-align:super}
span.nfase1
	{mso-style-parent:"";
	font-family:"Lucida Grande","serif";
	color:black;
	}
span.Forte1
	{mso-style-parent:"";
	font-family:"Lucida Grande","serif";
	color:black;
	font-weight:bold}
span.texto8
	{font-family:"Times New Roman","serif";
	}
span.bumpedfont15
	{mso-style-parent:"";
	color:black;
	}
span.Heading4Char
	{mso-style-parent:"";
	font-family:"Verdana","sans-serif";
	font-weight:bold}
span.atn
	{mso-style-parent:"";
	font-family:"Times New Roman","serif";
	}
span.longtext
	{mso-style-parent:"";
	font-family:"Times New Roman","serif";
	}
span.linkdestaque
	{}
span.Ttulo1Char
	{font-family:"Cambria","serif";
	font-weight:bold}
span.Fontepargpadro1
	{}
span.MsoBookTitle
	{font-variant:small-caps;
	letter-spacing:.25pt;
	font-weight:bold;
	}
span.doltraduztrad
	{font-family:"Times New Roman","serif";
	}
span.MquinadeescribirHTML
	{mso-style-parent:"";
	font-family:"Courier New";
	}
table.MsoTableGrid
	{border:1.0pt solid windowtext;
	font-size:10.0pt;
	font-family:"Calibri","sans-serif";
	}
span.MsoSubtleEmphasis
	{font-family:"Times New Roman","serif";
	color:#404040;
	font-style:italic}
span.nfaseSutil1
	{mso-style-parent:"";
	color:gray;
	font-style:italic;
	}
div.WordSection1
	{page:WordSection1;}
span.CharacterStyle4
	{}
span.highlight
	{}
span.StrongEmphasis
	{mso-style-parent:"";
	font-weight:bold;
	}
span.scayt-misspell-word
	{mso-style-parent:"";
	font-family:"Times New Roman","serif";
	}
div.WordSection2
	{page:WordSection2;}
div.WordSection3
	{page:WordSection3;}
div.WordSection4
	{page:WordSection4;}
div.WordSection5
	{page:WordSection5;}
div.WordSection6
	{page:WordSection6;}
div.WordSection7
	{page:WordSection7;}
div.WordSection8
	{page:WordSection8;}
div.WordSection9
	{page:WordSection9;}
-->
</style>
<title>D9255</title>
</head>

<body id='view' style="text-align: center">
<div align="center"><center>

<table border="0" cellpadding="0" cellspacing="0" width="70%">
  <tr>
    <td width="14%">
	<p align="center" style="margin-top: 13px; margin-bottom: 13px">
	<font SIZE="2" face="Arial">
	<img SRC="../../../_Ato2007-2010/2008/Decreto/Image4.gif" WIDTH="76" HEIGHT="82"></font></td>
    <td width="86%">
	<p align="center" style="margin-top: 13px; margin-bottom: 13px"><font color="#808000" face="Arial"><strong><big><big>
	Presidência da República</big></big><br>
    <big>Casa Civil<br>
    </big>Subchefia para Assuntos Jurídicos</strong></font></td>
  </tr>
</table>
</center></div>

<blockquote>
	<p class='epigrafe' style="margin-top: 20px; margin-bottom: 20px">
	<font
face="Arial" color="#000080"><small><strong>
	<a href="http://legislacao.planalto.gov.br/legisla/legislacao.nsf/Viw_Identificacao/DEC%209.255-2017?OpenDocument">
	<font color="#000080">
	DECRETO Nº 9.255, DE&nbsp;29 DE DEZEMBRO DE 2017</font></a></strong></small></font></p>
</blockquote>

<table border="0" cellpadding="0" cellspacing="0" width="100%">
  <tr>
    <td width="51%">

	<font face="Arial" size="2"><span style="color: black"><a href="#art2">
	Vigência</a></span></font></td>
    <td width="49%">

<p align="justify">

<font FACE="Arial" SIZE="2">
<span style="color: #800000">Regulamenta a <a href="../../2015/Lei/L13152.htm">
<font color="#800000">Lei n<s>º</s> 13.152, de 29 de julho
de 2015</font></a>, que dispõe sobre o valor do salário mínimo e a sua política de
valorização de longo prazo.</span></font></td>
  </tr>
</table>

<font FACE="Arial" SIZE="2">
<p style="margin-bottom:0cm;margin-bottom:.0001pt;text-align:
justify;text-indent:1.0cm"><b><span style="color: black">O PRESIDENTE DA
REPÚBLICA</span></b><span style="color: black">, no uso da atribuição que lhe
confere o art. 84, <b>caput</b>, inciso IV, da Constituição, e tendo em vista o
disposto no art. 2<s>º</s> da Lei n<s>º</s> 13.152, de 29 de julho de 2015,</span></p>
<p style="margin-bottom:0cm;margin-bottom:.0001pt;text-align:
justify;text-indent:1.0cm"><span style="color: black">&nbsp;</span><b><span style="color: black">DECRETA</span></b><span style="color: black">:</span></p>
<p style="margin-bottom:0cm;margin-bottom:.0001pt;text-align:
justify;text-indent:1.0cm"><span style="color: black">&nbsp;<a name="art1"></a>Art. 1<s>º</s>&nbsp; A partir
de 1<s>º</s> de janeiro de 2018, o salário mínimo será de R$ 954,00 (novecentos
e cinquenta e quatro reais).</span></p>
<p style="margin-bottom:0cm;margin-bottom:.0001pt;text-align:
justify;text-indent:1.0cm"><span style="color: black">&nbsp;Parágrafo único.&nbsp; Em
virtude do disposto no <b>caput</b>, o valor diário do salário mínimo
corresponderá a R$ 31,80 (trinta e um reais e oitenta centavos) e o valor
horário, a R$ 4,34 (quatro reais e trinta e quatro centavos).</span></p>
<p style="margin-bottom:0cm;margin-bottom:.0001pt;text-align:
justify;text-indent:1.0cm"><span style="color: black">&nbsp;<a name="art2"></a>Art. 2<s>º</s>&nbsp; Este
Decreto entra em vigor em 1<s>º</s> de janeiro de 2018.</span></p>
<p style="margin-bottom:0cm;margin-bottom:.0001pt;text-align:
justify;text-indent:1.0cm"><span style="color: black">&nbsp;Brasília, 29 de dezembro
de 2017; 196<s>º</s> da Independência e 129<s>º</s> da República.</span></p>
<p style="margin-bottom:0cm;margin-bottom:.0001pt;text-align:
justify"><span style="color: black">MICHEL TEMER<br>
</span><i><span style="color: black">Eduardo Refinetti Guardia<br>
Esteves Pedro Colnago Junior<br>
Helton Yomura</span></i></p>
<p style="margin-bottom:0cm;margin-bottom:.0001pt;text-align:
justify"><font color="#FF0000">Este texto não substitui o publicado no DOU de
29.12.2017 - Edição extra &quot;D&quot;</font></p>
<p><font color="#FF0000" face="Arial" size="2">*</font></p>
</font>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>

</body>
</html>

1 answer

1


I’ve come to this solution. I want to share to multiply the knowledge.

import os
import re
from bs4 import BeautifulSoup

__author__ = '@britodfbr'

with open(os.path.abspath('../data/D9255.htm')) as filein, \
        open(os.path.abspath('../fhtml/D9255.html'), 'wb') as fileout:
    conteudo = filein.read()
    conteudo = re.sub('\n', ' ', conteudo)
    soup = BeautifulSoup(conteudo, 'html.parser')

    #adiciona todas as tags com attrs=style ao container
    container = soup.select('[style]')

    # corre pelo container removendo o atribuito style das tags
    for i in container:
        del i['style']

    #remove elemento style
    soup.select('style')[0].decompose()

    #remove elemento meta generator
    soup.select('meta[name$="GENERATOR"]')[0].decompose()

    #cria nova tag link
    meta = soup.new_tag('link', type="text/css", rel="stylesheet", href="http://www.planalto.gov.br/ccivil_03/css/legis_3.css")

    #acrescenta nova tag no head
    soup.head.append(meta)

    #Novos valores ao container contendo as tags que serão retiradas da arvore de estrutura
    container = soup.select('span')
    container += soup.select('s')
    container += soup.select('font')
    # print(len(container))
    for i in container:
        i.unwrap()

    #grava o novo conteúdo em disco   
    fileout.write(soup.prettify(encoding='iso8859-1'))

Final result:

<html xmlns="http://www.w3.org/TR/REC-html40">
 <head>
  <title>
   D9255
  </title>
  <link href="http://www.planalto.gov.br/ccivil_03/css/legis_3.css" rel="stylesheet" type="text/css"/>
 </head>
 <body id="view">
  <div align="center">
   <center>
    <table border="0" cellpadding="0" cellspacing="0" width="70%">
     <tr>
      <td width="14%">
       <p align="center">
        <img height="82" src="../../../_Ato2007-2010/2008/Decreto/Image4.gif" width="76"/>
       </p>
      </td>
      <td width="86%">
       <p align="center">
        <strong>
         <big>
          <big>
           Presidência da República
          </big>
         </big>
         <br>
          <big>
           Casa Civil
           <br>
           </br>
          </big>
          Subchefia para Assuntos Jurídicos
         </br>
        </strong>
       </p>
      </td>
     </tr>
    </table>
   </center>
  </div>
  <blockquote>
   <p class="epigrafe">
    <small>
     <strong>
      <a href="http://legislacao.planalto.gov.br/legisla/legislacao.nsf/Viw_Identificacao/DEC%209.255-2017?OpenDocument">
       DECRETO Nº 9.255, DE 29 DE DEZEMBRO DE 2017
      </a>
     </strong>
    </small>
   </p>
  </blockquote>
  <table border="0" cellpadding="0" cellspacing="0" width="100%">
   <tr>
    <td width="51%">
     <a href="#art2">
      Vigência
     </a>
    </td>
    <td width="49%">
     <p align="justify">
      Regulamenta a
      <a href="../../2015/Lei/L13152.htm">
       Lei n
       º
       13.152, de 29 de julho de 2015
      </a>
      , que dispõe sobre o valor do salário mínimo e a sua política de valorização de longo prazo.
     </p>
    </td>
   </tr>
  </table>
  <p>
   <b>
    O PRESIDENTE DA REPÚBLICA
   </b>
   , no uso da atribuição que lhe confere o art. 84,
   <b>
    caput
   </b>
   , inciso IV, da Constituição, e tendo em vista o disposto no art. 2
   º
   da Lei n
   º
   13.152, de 29 de julho de 2015,
  </p>
  <p>
   <b>
    DECRETA
   </b>
   :
  </p>
  <p>
   <a name="art1">
   </a>
   Art. 1
   º
   A partir de 1
   º
   de janeiro de 2018, o salário mínimo será de R$ 954,00 (novecentos e cinquenta e quatro reais).
  </p>
  <p>
   Parágrafo único.  Em virtude do disposto no
   <b>
    caput
   </b>
   , o valor diário do salário mínimo corresponderá a R$ 31,80 (trinta e um reais e oitenta centavos) e o valor horário, a R$ 4,34 (quatro reais e trinta e quatro centavos).
  </p>
  <p>
   <a name="art2">
   </a>
   Art. 2
   º
   Este Decreto entra em vigor em 1
   º
   de janeiro de 2018.
  </p>
  <p>
   Brasília, 29 de dezembro de 2017; 196
   º
   da Independência e 129
   º
   da República.
  </p>
  <p>
   MICHEL TEMER
   <br>
   </br>
   <i>
    Eduardo Refinetti Guardia
    <br>
     Esteves Pedro Colnago Junior
     <br>
      Helton Yomura
     </br>
    </br>
   </i>
  </p>
  <p>
   Este texto não substitui o publicado no DOU de 29.12.2017 - Edição extra "D"
  </p>
  <p>
   *
  </p>
  <p>
  </p>
  <p>
  </p>
  <p>
  </p>
  <p>
  </p>
  <p>
  </p>
 </body>
</html>

Browser other questions tagged

You are not signed in. Login or sign up in order to post.