preg_match returning 0

Asked

Viewed 129 times

0

Good afternoon, I have a string I get from the database

<!DOCTYPE html> <html> <head> </head> <body> <div> </div> <div> </div> <div> <h3> </h3> <p> </p> <p> </p> <p> </p> <p> </p> <p> </p> <br /><br /><br /><br /><br /><br /> <h6> </h6> <br /><br /><br /><br /> <p>Portaria n&ordm; 69 de 18/01/2017 - Publicada no DOU de 19/01/2017</p> <br /> <h3>Certificamos que</h3> <h5>{NOME_ALUNO}</h5> <h3>concluiu em {DT_APR} o <br />{NOME_CURSO}<br />realizado pela ---- na qualidade de aluno(a), perfazendo um total de {CARGA_HOR} horas.</h3> <h3><em> </em></h3> <h4><em>Cidade </em>{DATE_EXT}<em> .</em></h4> </div> </body> </html>

and I’m trying to do a regex on this string preg_match('/^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/', $texto);

but it always returns 0, and if I test the regex on sites like https://regexr.com/ it returns me the expected result.

Someone might point out my mistake?

Thank you.

  • You get this HTML from a bank return?

  • Yes, I need to start this header because it is generating error in another process, but I was barred in regex, and try to take with str_replace also does not work.

  • 1

    The correct is preg_match('/^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/', $texto, $output); var_dump($output);. It is returning "0" because it is probably not capturing the result.

  • 2

    Parsing html with regex is simply wrong, and it is normal that it does not solve all the cases you want. Related question Why Regex should not be used to handle HTML?

  • 1

    In my humble opinion, the correct even is to use an HTML parser (unless you are absolutely sure that the strings always start with exactly <!DOCTYPE html> <html> <head> </head>). But if there’s anything inside head, for example, regex no longer works. If you don’t have exactly 1 space between tags, if you have any comments, if you have attributes (<html lang="en">), etc. I understand the "temptation" to use regex, it seems so easy and fast (and often it is), but for HTML Parsing, it is better to use specific parsers: https://stackoverflow.com/a/1732454

  • It’s probably not doing Parsing, it’s just treating a string as a string, whether or not it has HTML code. That’s a string: "eu sou uma string", and that’s also a string: "<b>eu sou uma string</b>"... now, doing Parsing is another story.

  • 1

    @hkotsubo hehe, I myself was going to put this mitic link but then I ended up not doing it! But someone saved the day :D. Good reading for anyone!

  • 1

    @Sam as I said, if there is an "absolute certainty" that strings will always be in this format, regex will do. But if you can see any HTML (with the variations I mentioned) then it is better to use a parser, because regex can end up getting too complicated. Most parsers give the option to manipulate the DOM (which in the end is what it wants to do, remove some tags). I am not against regex, I am against using the least suitable tool for each case :-)

  • @hkotsubo The problem is that it wants to remove <!DOCTYPE html> <html> <head> </head>, correct?! How to remove the opening of the tag <html> without removing the whole element using DOM handlers? Be the case there I agree to use handlers.

  • 1

    @But there it is: if the beginning is always <!DOCTYPE html> <html> <head> </head>, Isn’t it easier to use substar? And if the beginning is variable, maybe it is better to use a parser (unless the variations are simple, but without more examples, I have no way to evaluate what is best). In fact I would go back a step and see why it is necessary an html that does not have the tag html at first (this seems to be the real problem) :-)

  • So what happens is this, there is the generation of a pdf using mpdf, it has 2 pages, but for some reason it was appearing only the first page. After a lot of searching I saw that the problem is in this html tag, it’s creating a whole html inside another, as test I took the field I reported above and put in hand even the string that was left, then it worked, and that’s why I need to rip these tags. I have no idea why they saved it like that in the bank but now I’m left to solve this mess.

  • Managed to solve?

  • @sam yes, I got it using strip_tags. Thank you.

  • 1

    But why can’t you have the <!DOCTYPE html> <html> <head> </head> if it is an html ? This seems very strange to me as @hkotsubo has already mentioned and only by following this path can you really resolve correctly. The solution you have to me seems like a patch, which may need more patches in the future.

  • @sam Just to complement, there is a problem when the regex nay der match in string. See here that regex has to backtrack (go back and forth several times because it hasn’t found a match). For a single case it may not even "tickle" performance, but if you have to process too many files, for example, it can already start to make a difference. Finally, it does not mean that for a single file the regex does not serve, but it is important to know the implications of using each approach: https://www.regular-expressions.info/catastrophic.html

  • @Isac existed an html page, containing header, Divs style tags and everything else, and in one of these Divs was inserted the string I reported above. All this in a php file that generates a PDF that may or may not have more than 1 page. When it was just a page, wonderful, it worked normal, when there was more for some reason that I really do not know he did not create the second page for anything in the world, after much testing I saw that it was because of this html header that was coming in the string. I took the header and it worked normal.

Show 11 more comments

1 answer

4


Explaining the preg_match:

The function preg_match() accepts 5 parameters, the first two being mandatory.

  • The first parameter is the regular expression ($Pattern).
  • The second parameter is the string where we can search the expression ($Subject).
  • The third parameter is an array that stores the term you married ($Matches).

I ran a test on your code and look what came out:

$texto = "<!DOCTYPE html> <html> <head> </head> <body> <div> </div> <div> </div> <div> <h3> </h3> <p> </p> <p> </p> <p> </p> <p> </p> <p> </p> <br /><br /><br /><br /><br /><br /> <h6> </h6> <br /><br /><br /><br /> <p>Portaria nº 69 de 18/01/2017 - Publicada no DOU de 19/01/2017</p> <br /> <h3>Certificamos que</h3> <h5>{NOME_ALUNO}</h5> <h3>concluiu em {DT_APR} o <br />{NOME_CURSO}<br />realizado pela ---- na qualidade de aluno(a), perfazendo um total de {CARGA_HOR} horas.</h3> <h3><em> </em></h3> <h4><em>Cidade </em>{DATE_EXT}<em> .</em></h4> </div> </body> </html>";

$matches = array();

$resultado = preg_match('/^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/', $texto, $matches);

var_dump($resultado, $matches);

Giving a var_dump see the result:

int(1) array(1) { [0]=> string(37) " " }

Explaining what you are trying with your REGEX:

^ Indicates that it is the initial position of the string

<! finds all literal characters <! (case sensitive)

\w+ Find any character (containing the following standard [a-za-Z0-9_])

+ Quantifier - Find one or more times, as many times as possible, back back if necessary (Greedy)

\s Find any blank space (can be [ r n t f v ])

\w+ Find any character (containing the following standard [a-za-Z0-9_])

+ Quantifier - Find one or more times, as many times as possible, back back if necessary (Greedy)

> Find > (case sensitive)

\s search for empty spaces (may be [ r n t f v ])

< Find < (case sensitive)

\w+ Find any character (containing the following standard [a-za-Z0-9_])

+ Quantifier - Find one or more times, as many times as possible, back back if necessary (Greedy)

> Find > (case sensitive)

\s search for empty spaces (may be [ r n t f v ])

< Find < (case sensitive)

\w+ Find any character (containing the following standard [a-za-Z0-9_])

> Find > (case sensitive)

\s search for empty spaces (may be [ r n t f v ])

< Find < (case sensitive)

\/ Find / (case sensitive)

\w+ Find any character (containing the following standard [a-za-Z0-9_])

> Find > (case sensitive)

by putting all this together we have your regex /^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/

If what you want is to remove HTML tags from a string just use the strip_tags function():

(PHP 4, PHP 5, PHP 7)

strip_tags - Remove HTML and PHP tags from a string

strip_tags ( string $str [, string $allowable_tags ] )

Parameters

str The input string.

allowable_tags You can use the second parameter, which is optional, to indicate tags that should not be removed.

Note:

HTML comments and PHP tags are also removed. And this cannot be modified with allowable_tags.

Example strip_tags()

<?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";

// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>

The above example will print:

Test paragraph. Other text
<p>Test paragraph.</p> <a href="#fragment">Other text</a>

Reference: PHP: strip_tags

  • I’ll try to do it that way. Thank you!

  • 1

    Worked with strip_tags. Thank you very much!!

Browser other questions tagged

You are not signed in. Login or sign up in order to post.