Remove text within a comment tag with PHP?

Question

Remove text within a comment tag with PHP?

Asked 6 years, 6 months ago

Viewed 194 times

3

My client usually does copy/Paste of news providers that contain HTML comments.

That is, HTML does not hurt and when inserting they do not appear in the text editor but I use the PHP mailer that ends up sending this text and makes visible these comments.

I indicate below an example:

What can I do so that before entering into the database, it removes this tag?

https://www.php.net/manual/en/function.strip-tags.php

– MagicHat

2019/05/10 at 19:39
@Magichat But how do I define that tag? echo strip_tags($text, ''); Thus?

– I_like_trains

2019/05/10 at 19:40
I don’t know brother, it was just a suggestion, do some tests... I thought this function would suit,look at her name... Here a little bit appears someone with preg_match, too....

– MagicHat

2019/05/10 at 19:46
The strip_tags($texto) will remove all HTML tags and will leave only the text in place. If you want to leave some tag behind use the form strip_tags($texto,"[<tag>]+") where [<tag>]+ means the literal transcription of one or more tags you want to pass up. Now if you need to do a text search and delete what is tagged @Sam’s answer is what to do.

– Augusto Vasques

2019/05/10 at 20:01

2 answers

6

Why don’t you use one preg_replace?

<?
$string = 'a<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1;
 mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-
 ... -->b<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1;
  mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-... -->';
$string = preg_replace('/<!--.*?-->/s', '', $string);
echo $string; // imprime: ab
?>

The regular expression pattern //s will fetch everything between the signals  (including signals) and remove from the string.

Note in the above example that there is only one letter "a" and one "b" outside the comment blocks. So, when doing replace, only these two letters will remain in the string.

Testing at IDEONE

I used your method but for some reason. I isolated the html comment and it deleted but when using all the text it stops working.

– I_like_trains

2019/05/10 at 20:13
I made a recent change in regex: //s... see if you put it exactly this way.

– Sam

2019/05/10 at 20:15

Browser other questions tagged php

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-05-11T00:35:14+00:00

As already commented, one option is to use strip_tags, as it will remove the HTML comments from the string:

$string = '<p>texto</p><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1;
 mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-
 ... --><span>Mais texto</span><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1;
  mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-... -->';
echo strip_tags($string);

The detail is that this function removes all HTML tags, leaving only the text:

textoMais texto

But you can pass a list of tags that should be kept. For example:

// manter as tags <p> e <span>
echo strip_tags($string, "<p><span>");

Exit:

<p>texto</p><span>Mais texto</span>

But if your string has many HTML tags, it can be a bit tedious to pass the list of all valid tags to strip_tags. So another option is to use DOMDocument:

$dom = new DOMDocument;
$dom->loadHtml($string);
$xpath = new DOMXPath($dom);
// remover os comentários
foreach ($xpath->query('//comment()') as $comment) {
    $comment->parentNode->removeChild($comment);
}

// obter o body do documento (agora sem comentários) como uma string
$newString = "";
$body = $dom->getElementsByTagName('body')->item(0);
$children  = $body->childNodes;
foreach ($children as $child) { 
    $newString .= $dom->saveHTML($child);
}
echo $newString;

Exit:

<p>texto</p>
<span>Mais texto</span>

Regex

If you really want to use regex, you can use:

echo preg_replace('/<!--[^>]*-->/', '', $string);

Differences with respect to another answer (which is also correct):

The .*? indicates that regex will pick up zero or more occurrences of any character. Thanks to the option s used in the other answer, the point even corresponds to line breaks (by default he does not have this behavior), and the ? indicates that regex tries to pick up as few characters as possible (this prevents the point from picking up a comment closure "unintentionally", since the point corresponds to any character, and if regex finds it necessary, it can pick up -->).

I used to [^>]*: zero or more occurrences of any character that nay be it >. This makes the regex a little faster (since .*?, despite being very convenient and working, has its price: as the point corresponds to any character, regex needs to keep going back and forth several times in the string checking and testing if it needs to consume more characters that satisfy it). See that the version with .*? needs 9 to 10 times more steps to perform than the version with [^>]*.

Obviously, for a few small executions and strings, the difference will be irrelevant (maybe it only makes a difference for very long processing and very large strings). And even the amount of steps can vary, since each language and engine has distinct internal optimizations (but in general, say exactly what you want - and what you don’t want - usually makes regex faster than using .*).

There’s only one catch: if within the comments you have a > (but other than closing the comment), the above regex does not work (see). In that case, you can continue using .*? even (see). Or, if you want to use something really complicated:

$string = '<p>texto</p><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable;> ESSE CARACTERE AQUI QUEBRA TUDOmso-font-signature:0 0 0 0 0 0;} @font-face {font- ... --><span>Mais texto</span><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1;  mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-... -->';
echo preg_replace('/<!--(?>[^-]*)(?>(?!-->)-[^-]*)*-->/', '', $string);

This regex uses the technique of unroll the loop (withdrawal of this book), and uses several advanced features to detect a comment (such as atomic groups - the passages with (?> - to avoid the backtracking - "come and go" to check the various parts of regex, which are not always needed). It solves the problem of > mentioned above, and although more complicated, still she is faster than using .*? - compare here and here). It also deletes comments and the output is:

<p>texto</p><span>Mais texto</span>

Although regex is cool (I particularly like it a lot), I find the solution with DOMDocument simpler (since it is a specific library to handle all the particularities of HTML syntax - something that, as you can see, is not so trivial with regex).

A regex can even handle simpler cases, but complicate your HTML a little more and the problem becomes more difficult than it seemed (obviously if your strings don’t fall in these more complex cases, use .*? or any other of the expressions suggested above works smoothly).