2
I need to ignore the tags img
because some images come in Base64 and contain //
which are recognized as comments and are removed, breaking all HTML code.
function limpa_html($html){
$pattern = '/(?:(?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:(?<!\:|\\\|\'|\")\/\/.*))/';
$html = preg_replace($pattern, '', $html); //apaga comentarios em js
return $html;
}
This is the function suits me super well, but images like
<img src="data:image/jpeg;base64,/9j/4AAQSkZJRgAAQABAAD/2wBDAAQgHBwcJCQg//
break HTML. The idea would be to ignore tags img
or limit code only to tags script
, but I do not understand from regex to that point.
Here a simulation of what happens:
<script>
// This is a comment
/* This is another comment */
// The following is not a comment
var src="//google.com";
</script>
<img src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQg//ejhdbjkebdklebdklenbdknedklnekdlelde>
In the example above, the final tag snippet img
(//ejhdbjkebdklebdklenbdknedklnekdlelde
) is also removed.
Take a look here https://www.php.net/manual/en/class.domxpath.php#87645
– Marcos Xavier