11
As part of a procedure, I need to extract the contents of a table present on a page. I’m using Curl to get the raw HTML data and the Simple HTML DOM Parser to parse and process HTML.
<?php
// (...)
require_once('simple_html_dom.php');
// (...)
$objPagina = str_get_html($strPagina);
$objItems = $objPagina->find('table', 0);
echo $objItems->outertext;
?>
At first everything works as desired. However, in a specific case the received HTML is poorly formed. At this point Simple HTML DOM Parser cannot correctly process HTML and returns an incorrect result.
The browser can correctly display the content, but as far as I know browsers are designed to correctly render a malformed HTML. In fact, if I open the "developer tools" of Firefox, copy the displayed HTML there, paste as a text file and use this text as the source for the parser, I can get the desired result.
Since I can’t modify the HTML I receive, what can I do to programmatically process HTML? It seems to me that should not use regular expressions.
+1 for the most epic answer in Stack Overflow history.
– Rodrigo Rigotti
@Epic and deceptive Rodrigorigotti. Because it is not correct to say that you should never use regex to parse any HTML chunk... By the way, the new website design or my version of Chrome is framing Zalgo Text, no more characters appear going up and down.
– bfavaretto
@bfavaretto has never been able to interpret an html completely using regular expressions - in my opinion for the same reasons as the answer I quoted gives. I believe that some less complex elements can be read, but not an entire HTML document.
– Rodrigo Rigotti
Yes Rodrigo, a whole document does not give. But it is that that answer drew so much attention that in the OS any question about extracting something from HTML (for example, an attribute of a single tag in a 20-character string) ends up marked as a duplicate of that.
– bfavaretto
One option is to use a headless browser to interpret the document and generate a well-formed version.
– bfavaretto