As already commented, one option is to use strip_tags
, as it will remove the HTML comments from the string:
$string = '<p>texto</p><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1;
mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-
... --><span>Mais texto</span><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1;
mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-... -->';
echo strip_tags($string);
The detail is that this function removes all HTML tags, leaving only the text:
textoMais texto
But you can pass a list of tags that should be kept. For example:
// manter as tags <p> e <span>
echo strip_tags($string, "<p><span>");
Exit:
<p>texto</p><span>Mais texto</span>
But if your string has many HTML tags, it can be a bit tedious to pass the list of all valid tags to strip_tags
. So another option is to use DOMDocument
:
$dom = new DOMDocument;
$dom->loadHtml($string);
$xpath = new DOMXPath($dom);
// remover os comentários
foreach ($xpath->query('//comment()') as $comment) {
$comment->parentNode->removeChild($comment);
}
// obter o body do documento (agora sem comentários) como uma string
$newString = "";
$body = $dom->getElementsByTagName('body')->item(0);
$children = $body->childNodes;
foreach ($children as $child) {
$newString .= $dom->saveHTML($child);
}
echo $newString;
Exit:
<p>texto</p>
<span>Mais texto</span>
Regex
If you really want to use regex, you can use:
echo preg_replace('/<!--[^>]*-->/', '', $string);
Differences with respect to another answer (which is also correct):
The .*?
indicates that regex will pick up zero or more occurrences of any character. Thanks to the option s
used in the other answer, the point even corresponds to line breaks (by default he does not have this behavior), and the ?
indicates that regex tries to pick up as few characters as possible (this prevents the point from picking up a comment closure "unintentionally", since the point corresponds to any character, and if regex finds it necessary, it can pick up -->
).
I used to [^>]*
: zero or more occurrences of any character that nay be it >
. This makes the regex a little faster (since .*?
, despite being very convenient and working, has its price: as the point corresponds to any character, regex needs to keep going back and forth several times in the string checking and testing if it needs to consume more characters that satisfy it). See that the version with .*?
needs 9 to 10 times more steps to perform than the version with [^>]*
.
Obviously, for a few small executions and strings, the difference will be irrelevant (maybe it only makes a difference for very long processing and very large strings). And even the amount of steps can vary, since each language and engine has distinct internal optimizations (but in general, say exactly what you want - and what you don’t want - usually makes regex faster than using .*
).
There’s only one catch: if within the comments you have a >
(but other than closing the comment), the above regex does not work (see). In that case, you can continue using .*?
even (see). Or, if you want to use something really complicated:
$string = '<p>texto</p><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable;> ESSE CARACTERE AQUI QUEBRA TUDOmso-font-signature:0 0 0 0 0 0;} @font-face {font- ... --><span>Mais texto</span><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-charset:1; mso-generic-font-family:roman; mso-font-format:other; mso-font-pitch:variable; mso-font-signature:0 0 0 0 0 0;} @font-face {font-... -->';
echo preg_replace('/<!--(?>[^-]*)(?>(?!-->)-[^-]*)*-->/', '', $string);
This regex uses the technique of unroll the loop (withdrawal of this book), and uses several advanced features to detect a comment (such as atomic groups - the passages with (?>
- to avoid the backtracking - "come and go" to check the various parts of regex, which are not always needed). It solves the problem of >
mentioned above, and although more complicated, still she is faster than using .*?
- compare here and here). It also deletes comments and the output is:
<p>texto</p><span>Mais texto</span>
Although regex is cool (I particularly like it a lot), I find the solution with DOMDocument
simpler (since it is a specific library to handle all the particularities of HTML syntax - something that, as you can see, is not so trivial with regex).
A regex can even handle simpler cases, but complicate your HTML a little more and the problem becomes more difficult than it seemed (obviously if your strings don’t fall in these more complex cases, use .*?
or any other of the expressions suggested above works smoothly).
https://www.php.net/manual/en/function.strip-tags.php
– MagicHat
@Magichat But how do I define that tag?
echo strip_tags($text, '<!-- -->');
Thus?– I_like_trains
I don’t know brother, it was just a suggestion, do some tests... I thought this function would suit,look at her name... Here a little bit appears someone with preg_match, too....
– MagicHat
The
strip_tags($texto)
will remove all HTML tags and will leave only the text in place. If you want to leave some tag behind use the formstrip_tags($texto,"[<tag>]+")
where[<tag>]+
means the literal transcription of one or more tags you want to pass up. Now if you need to do a text search and delete what is tagged @Sam’s answer is what to do.– Augusto Vasques