Explaining the preg_match:
The function preg_match() accepts 5 parameters, the first two being mandatory.
- The first parameter is the regular expression ($Pattern).
- The second parameter is the string where we can search the expression ($Subject).
- The third parameter is an array that stores the term you married ($Matches).
I ran a test on your code and look what came out:
$texto = "<!DOCTYPE html> <html> <head> </head> <body> <div> </div> <div> </div> <div> <h3> </h3> <p> </p> <p> </p> <p> </p> <p> </p> <p> </p> <br /><br /><br /><br /><br /><br /> <h6> </h6> <br /><br /><br /><br /> <p>Portaria nº 69 de 18/01/2017 - Publicada no DOU de 19/01/2017</p> <br /> <h3>Certificamos que</h3> <h5>{NOME_ALUNO}</h5> <h3>concluiu em {DT_APR} o <br />{NOME_CURSO}<br />realizado pela ---- na qualidade de aluno(a), perfazendo um total de {CARGA_HOR} horas.</h3> <h3><em> </em></h3> <h4><em>Cidade </em>{DATE_EXT}<em> .</em></h4> </div> </body> </html>";
$matches = array();
$resultado = preg_match('/^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/', $texto, $matches);
var_dump($resultado, $matches);
Giving a var_dump see the result:
int(1) array(1) { [0]=> string(37) " " }
Explaining what you are trying with your REGEX:
^
Indicates that it is the initial position of the string
<!
finds all literal characters <!
(case sensitive)
\w+
Find any character (containing the following standard [a-za-Z0-9_])
+
Quantifier - Find one or more times, as many times as possible, back
back if necessary (Greedy)
\s
Find any blank space (can be [ r n t f v ])
\w+
Find any character (containing the following standard [a-za-Z0-9_])
+
Quantifier - Find one or more times, as many times as possible, back
back if necessary (Greedy)
>
Find > (case sensitive)
\s
search for empty spaces (may be [ r n t f v ])
<
Find < (case sensitive)
\w+
Find any character (containing the following standard [a-za-Z0-9_])
+
Quantifier - Find one or more times, as many times as possible, back
back if necessary (Greedy)
>
Find > (case sensitive)
\s
search for empty spaces (may be [ r n t f v ])
<
Find < (case sensitive)
\w+
Find any character (containing the following standard [a-za-Z0-9_])
>
Find > (case sensitive)
\s
search for empty spaces (may be [ r n t f v ])
<
Find < (case sensitive)
\/
Find / (case sensitive)
\w+
Find any character (containing the following standard [a-za-Z0-9_])
>
Find > (case sensitive)
by putting all this together we have your regex /^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/
If what you want is to remove HTML tags from a string just use the strip_tags function():
(PHP 4, PHP 5, PHP 7)
strip_tags - Remove HTML and PHP tags from a string
strip_tags ( string $str [, string $allowable_tags ] )
Parameters
str
The input string.
allowable_tags
You can use the second parameter, which is optional, to indicate tags that should not be removed.
Note:
HTML comments and PHP tags are also removed. And this cannot be modified with allowable_tags.
Example strip_tags()
<?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";
// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>
The above example will print:
Test paragraph. Other text
<p>Test paragraph.</p> <a href="#fragment">Other text</a>
Reference: PHP: strip_tags
You get this HTML from a bank return?
– Alvaro Alves
Yes, I need to start this header because it is generating error in another process, but I was barred in regex, and try to take with str_replace also does not work.
– Cesar Vinicius
The correct is
preg_match('/^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/', $texto, $output); var_dump($output);
. It is returning "0" because it is probably not capturing the result.– Valdeir Psr
Parsing html with regex is simply wrong, and it is normal that it does not solve all the cases you want. Related question Why Regex should not be used to handle HTML?
– Isac
In my humble opinion, the correct even is to use an HTML parser (unless you are absolutely sure that the strings always start with exactly
<!DOCTYPE html> <html> <head> </head>
). But if there’s anything insidehead
, for example, regex no longer works. If you don’t have exactly 1 space between tags, if you have any comments, if you have attributes (<html lang="en">
), etc. I understand the "temptation" to use regex, it seems so easy and fast (and often it is), but for HTML Parsing, it is better to use specific parsers: https://stackoverflow.com/a/1732454– hkotsubo
It’s probably not doing Parsing, it’s just treating a string as a string, whether or not it has HTML code. That’s a string:
"eu sou uma string"
, and that’s also a string:"<b>eu sou uma string</b>"
... now, doing Parsing is another story.– Sam
@hkotsubo hehe, I myself was going to put this mitic link but then I ended up not doing it! But someone saved the day :D. Good reading for anyone!
– Isac
@Sam as I said, if there is an "absolute certainty" that strings will always be in this format, regex will do. But if you can see any HTML (with the variations I mentioned) then it is better to use a parser, because regex can end up getting too complicated. Most parsers give the option to manipulate the DOM (which in the end is what it wants to do, remove some tags). I am not against regex, I am against using the least suitable tool for each case :-)
– hkotsubo
@hkotsubo The problem is that it wants to remove
<!DOCTYPE html> <html> <head> </head>
, correct?! How to remove the opening of the tag<html>
without removing the whole element using DOM handlers? Be the case there I agree to use handlers.– Sam
@But there it is: if the beginning is always
<!DOCTYPE html> <html> <head> </head>
, Isn’t it easier to use substar? And if the beginning is variable, maybe it is better to use a parser (unless the variations are simple, but without more examples, I have no way to evaluate what is best). In fact I would go back a step and see why it is necessary an html that does not have the taghtml
at first (this seems to be the real problem) :-)– hkotsubo
So what happens is this, there is the generation of a pdf using mpdf, it has 2 pages, but for some reason it was appearing only the first page. After a lot of searching I saw that the problem is in this html tag, it’s creating a whole html inside another, as test I took the field I reported above and put in hand even the string that was left, then it worked, and that’s why I need to rip these tags. I have no idea why they saved it like that in the bank but now I’m left to solve this mess.
– Cesar Vinicius
Managed to solve?
– Sam
@sam yes, I got it using strip_tags. Thank you.
– Cesar Vinicius
But why can’t you have the
<!DOCTYPE html> <html> <head> </head>
if it is an html ? This seems very strange to me as @hkotsubo has already mentioned and only by following this path can you really resolve correctly. The solution you have to me seems like a patch, which may need more patches in the future.– Isac
@sam Just to complement, there is a problem when the regex nay der match in string. See here that regex has to backtrack (go back and forth several times because it hasn’t found a match). For a single case it may not even "tickle" performance, but if you have to process too many files, for example, it can already start to make a difference. Finally, it does not mean that for a single file the regex does not serve, but it is important to know the implications of using each approach: https://www.regular-expressions.info/catastrophic.html
– hkotsubo
@Isac existed an html page, containing header, Divs style tags and everything else, and in one of these Divs was inserted the string I reported above. All this in a php file that generates a PDF that may or may not have more than 1 page. When it was just a page, wonderful, it worked normal, when there was more for some reason that I really do not know he did not create the second page for anything in the world, after much testing I saw that it was because of this html header that was coming in the string. I took the header and it worked normal.
– Cesar Vinicius