bs4: How to wrap an incomplete html code?

Asked

Viewed 84 times

3

Hello, I came across incomplete html codes where are missing the tags "html" and "body".

Follow the code I implemented:

import bs4

content='''
<head>
 <title>
  my page
 </title>
</head>
  <table border="0" cellpadding="0" cellspacing="0">
   <tr>
    <td>
     <p>
      <img alt="Brastra.gif (4376 bytes)" height="82" src="../../DNN/Brastra.gif"/>
     </p>
    </td>
    <td>
     <p>
      <strong>
       Titulo 1
       <br/>
       Titulo 2
       <br/>
       Titulo 3
      </strong>
     </p>
    </td>
   </tr>
  </table>
 <small>
  <strong>
   <a href="http://example.com/">
    Link.
   </a>
  </strong>
 </small>
<p>
  <a href="http://example.com/">I linked to <i>example.com</i></a>
</p>
<p>#1</p>
<p>#2</p>
'''

soup = bs4.Beautifulsoup(content, 'html.parser')

I tried the section below that presents an error.

tag = soup.new_tag('html')
tag.wrap(soup)

Valueerror: Cannot replace one element with Another when theelement to be replaced is not part of a Tree.

And tried this other one that mixes the order of the tags:

for item in soup.find_all():
    tag.append(item.extract())
soup = tag

<body>
 <head>
 </head>
 <title>
  my page
 </title>
 <div>
 </div>
 <center>
 </center>
 <table border="0" cellpadding="0" cellspacing="0">
 </table>
 <tr>
 </tr>
 <td>
 </td>

How can I solve my problem with bs4, to wrap the code with the tags 'body' and 'html'?

1 answer

1

For that you will need the parser html5lib.

pip install html5lib

I tried on my console and this was the result:

In [2]:import bs4

In [3]:content='''
<head>
 <title>
  my page
 </title>
</head>
  <table border="0" cellpadding="0" cellspacing="0">
   <tr>
    <td>
     <p>
      <img alt="Brastra.gif (4376 bytes)" height="82" src="../../DNN/Brastra.gif"/>
     </p>
    </td>
    <td>
     <p>
      <strong>
       Titulo 1
       <br/>
       Titulo 2
       <br/>
       Titulo 3
      </strong>
     </p>
    </td>
   </tr>
  </table>
 <small>
  <strong>
   <a href="http://example.com/">
    Link.
   </a>
  </strong>
 </small>
<p>
  <a href="http://example.com/">I linked to <i>example.com</i></a>
</p>
<p>#1</p>
<p>#2</p>
'''

In [4]: soup = bs4.Beautifulsoup(content, 'html5lib')

In [5]: soup
Out[5]: 
<html><head>
 <title>
  my page
 </title>
</head>
  <body><table border="0" cellpadding="0" cellspacing="0">
   <tbody><tr>
    <td>
     <p>
      <img alt="Brastra.gif (4376 bytes)" height="82" src="../../DNN/Brastra.gif"/>
     </p>
    </td>
    <td>
     <p>
      <strong>
       Titulo 1
       <br/>
       Titulo 2
       <br/>
       Titulo 3
      </strong>
     </p>
    </td>
   </tr>
  </tbody></table>
 <small>
  <strong>
   <a href="http://example.com/">
    Link.
   </a>
  </strong>
 </small>
<p>
  <a href="http://example.com/">I linked to <i>example.com</i></a>
</p>
<p>#1</p>
<p>#2</p>
</body></html>

Browser other questions tagged

You are not signed in. Login or sign up in order to post.