Fix HTML of a page using Ruby/Nokogiri

Asked

Viewed 54 times

1

I’m having a little difficulty consuming an HTML generated by a third party page, where HTML is missing some closing tags.

For example:

<div>
  <li>
    <div>
      <div>test
        test
      </div>
      <li>
        <div>test 
          <div>test2</div>
        </div>

Running the Nokogiri parse

html = Nokogiri::HTML(open('origem.html'))

The result is:

inserir a descrição da imagem aqui

Or in HTML:

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body><div>
      <li>
        <div>
          <div>test
            test
          </div>
          <li>
            <div>test 
              <div>test2</div>
            </div>
    </li>
    </div>
    </li>
    </div></body></html>

And the right thing would be something like:

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<div>
<li>
  <div>
    <div>test
      test
    </div>
  </div>
</li>
<li>
  <div>test 
    <div>test2</div>
  </div>
</li>
</div>
</body></html>

1 answer

1


To reply was sent in the OR.

Basically using the Gem Nokogumbo in conjunction with Nokogiri, where the HTML5 parse results in the same HTML correction used by Google Chrome!

Works beautifully!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.