Remove Javascript comments from the script tag of an HTML, ignoring content from other tags (e.g., base 64 images)

Asked

Viewed 49 times

2

I need to ignore the tags img because some images come in Base64 and contain // which are recognized as comments and are removed, breaking all HTML code.

function limpa_html($html){
    $pattern = '/(?:(?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:(?<!\:|\\\|\'|\")\/\/.*))/';
    $html = preg_replace($pattern, '', $html); //apaga comentarios em js
    return $html;
}

This is the function suits me super well, but images like <img src="data:image/jpeg;base64,/9j/4AAQSkZJRgAAQABAAD/2wBDAAQgHBwcJCQg// break HTML. The idea would be to ignore tags img or limit code only to tags script, but I do not understand from regex to that point.

Here a simulation of what happens:

<script>

// This is a comment
/* This is another comment */

// The following is not a comment
var src="//google.com"; 

</script>
<img src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQg//ejhdbjkebdklebdklenbdknedklnekdlelde>

In the example above, the final tag snippet img (//ejhdbjkebdklebdklenbdknedklnekdlelde) is also removed.

  • Take a look here https://www.php.net/manual/en/class.domxpath.php#87645

1 answer

4


Instead of using regex across HTML, you can use DOMDocument to get only the tags script, and apply the regex only to them.

This way you guarantee that you will replace only the Javascript code, without having to worry about the other tags. Example:

$text = <<<TEXTO
<script>

// This is a comment
/* This is another comment */

// The following is not a comment
var src="//google.com"; 

</script>
<img src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQg//ejhdbjkebdklebdklenbdknedklnekdlelde>
TEXTO;

// mudar o nome da função e do parâmetro para deixar bem claro o que ela faz
function limpa_comentarios($jsCode) {
    $pattern = '/(?:(?:\/\*(?:[^*]|(?:\*+[^*\/]))*\*+\/)|(?:(?<!\:|\\\|\'|\")\/\/.*))/';
    return preg_replace($pattern, '', $jsCode); // apaga comentários em js
}

// carrega o HTML
$dom = new DOMDocument;
$dom->loadHtml($text); // $text é uma string contendo o HTML
$xpath = new DOMXPath($dom);
// procura as tags script e aplica a regex somente nelas
foreach ($xpath->query("//script") as $script) {
    $newContent = limpa_comentarios($script->nodeValue);
    // substitui o conteúdo do script
    $script->nodeValue = '';
    $script->appendChild($dom->createTextNode($newContent));
}
echo $dom->saveHTML();

But if the idea is minify HTML/JS, you might want to search for dedicated libs, instead of trying to do everything manually.

  • I understood its positioning but use a template that assembles all the HTML assembles variables in several parts of HTML, so it would be interesting to include it in the template, otherwise you would have to take the output of the template and use this your solution there.. In fact the function cleans all the HTML, I left only the part that is actually damaging the final file..

  • @Thiagocondé Then the ideal solution would be to modify the template so that it already generates the code without comments. If it is not possible to modify the template, there is no way, you have to process the HTML after (or modify the function limpa_html to use DOMDocument and only remove comment from tags script, from what I understand gives anyway) - trying to make a regex that processes all the HTML is much more complicated than it seems (not to mention that the current regex is already complicated enough)

  • Thanks for the help!! I think I decided without wanting to but I will have to review all comments the solution is to add a space between the two bars and the comments /spaceComentario is a solution. But thanks for your help I didn’t even know this php Domdocument. It can be very useful from now on!! anyway thank you very much!!

  • [SOLVED] $Pattern = '/(?:(?:/*(?:[^]|(?:*+[^/]))*+/)|(?:(?<!:|\|'|")\s//.))/';

Browser other questions tagged

You are not signed in. Login or sign up in order to post.