How to parse syntactically malformed HTML?

Asked

Viewed 262 times

11

As part of a procedure, I need to extract the contents of a table present on a page. I’m using Curl to get the raw HTML data and the Simple HTML DOM Parser to parse and process HTML.

<?php

// (...)
require_once('simple_html_dom.php');
// (...)
$objPagina = str_get_html($strPagina);
$objItems =  $objPagina->find('table', 0);
echo $objItems->outertext;

?>

At first everything works as desired. However, in a specific case the received HTML is poorly formed. At this point Simple HTML DOM Parser cannot correctly process HTML and returns an incorrect result.

The browser can correctly display the content, but as far as I know browsers are designed to correctly render a malformed HTML. In fact, if I open the "developer tools" of Firefox, copy the displayed HTML there, paste as a text file and use this text as the source for the parser, I can get the desired result.

Since I can’t modify the HTML I receive, what can I do to programmatically process HTML? It seems to me that should not use regular expressions.

  • 3

    +1 for the most epic answer in Stack Overflow history.

  • @Epic and deceptive Rodrigorigotti. Because it is not correct to say that you should never use regex to parse any HTML chunk... By the way, the new website design or my version of Chrome is framing Zalgo Text, no more characters appear going up and down.

  • @bfavaretto has never been able to interpret an html completely using regular expressions - in my opinion for the same reasons as the answer I quoted gives. I believe that some less complex elements can be read, but not an entire HTML document.

  • Yes Rodrigo, a whole document does not give. But it is that that answer drew so much attention that in the OS any question about extracting something from HTML (for example, an attribute of a single tag in a 20-character string) ends up marked as a duplicate of that.

  • 1

    One option is to use a headless browser to interpret the document and generate a well-formed version.

2 answers

4


You can try the extension Tidy php.
With this extension it is possible to validate and purify an ill-formed HTML.

An example (taken from php manual)

// Configuração
$config = array(
           'indent'         => true,
           'output-xhtml'   => true,
           'wrap'           => 200);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

// Output
echo $tidy;

Just note that on the official website of the extension it seems that the last update took place in 2009, so it may be that this solution does not solve your problem.

  • "Newest Official release 5.6.0, November 2017. If you want the Latest Official, get the master branch of our Tidy-Html5 Repository" http://www.html-tidy.org/ or https://github.com/htacg/tidy-html5

0

Try using the xmllint directly.

1) install xmllint (free and minimal tool)

need to extract the contents of a present table

2) Invokes

xmllint --html --xpath '//table' 'http://my.remote.page/x.html' > tabelas.txt

(adapts the expression xpath to your needs) and if it gives results, insert the invocation in Php

Browser other questions tagged

You are not signed in. Login or sign up in order to post.