preg_match returning 0

Question

preg_match returning 0

Asked 6 years, 10 months ago

Viewed 129 times

0

Good afternoon, I have a string I get from the database

<!DOCTYPE html> <html> <head> </head> <body> <div> </div> <div> </div> <div> <h3> </h3> <p> </p> <p> </p> <p> </p> <p> </p> <p> </p> <br /><br /><br /><br /><br /><br /> <h6> </h6> <br /><br /><br /><br /> <p>Portaria n&ordm; 69 de 18/01/2017 - Publicada no DOU de 19/01/2017</p> <br /> <h3>Certificamos que</h3> <h5>{NOME_ALUNO}</h5> <h3>concluiu em {DT_APR} o <br />{NOME_CURSO}<br />realizado pela ---- na qualidade de aluno(a), perfazendo um total de {CARGA_HOR} horas.</h3> <h3><em> </em></h3> <h4><em>Cidade </em>{DATE_EXT}<em> .</em></h4> </div> </body> </html>

and I’m trying to do a regex on this string preg_match('/^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/', $texto);

but it always returns 0, and if I test the regex on sites like https://regexr.com/ it returns me the expected result.

Someone might point out my mistake?

Thank you.

You get this HTML from a bank return?

– Alvaro Alves

2018/09/27 at 20:50
Yes, I need to start this header because it is generating error in another process, but I was barred in regex, and try to take with str_replace also does not work.

– Cesar Vinicius

2018/09/27 at 20:52
1

The correct is preg_match('/^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/', $texto, $output); var_dump($output);. It is returning "0" because it is probably not capturing the result.

– Valdeir Psr

2018/09/27 at 21:03
2

Parsing html with regex is simply wrong, and it is normal that it does not solve all the cases you want. Related question Why Regex should not be used to handle HTML?

– Isac

2018/09/27 at 21:23
1

In my humble opinion, the correct even is to use an HTML parser (unless you are absolutely sure that the strings always start with exactly <!DOCTYPE html> <html> <head> </head>). But if there’s anything inside head, for example, regex no longer works. If you don’t have exactly 1 space between tags, if you have any comments, if you have attributes (<html lang="en">), etc. I understand the "temptation" to use regex, it seems so easy and fast (and often it is), but for HTML Parsing, it is better to use specific parsers: https://stackoverflow.com/a/1732454

– hkotsubo

2018/09/27 at 21:26
It’s probably not doing Parsing, it’s just treating a string as a string, whether or not it has HTML code. That’s a string: "eu sou uma string", and that’s also a string: "<b>eu sou uma string</b>"... now, doing Parsing is another story.

– Sam

2018/09/27 at 21:29
1

@hkotsubo hehe, I myself was going to put this mitic link but then I ended up not doing it! But someone saved the day :D. Good reading for anyone!

– Isac

2018/09/27 at 21:31
1

@Sam as I said, if there is an "absolute certainty" that strings will always be in this format, regex will do. But if you can see any HTML (with the variations I mentioned) then it is better to use a parser, because regex can end up getting too complicated. Most parsers give the option to manipulate the DOM (which in the end is what it wants to do, remove some tags). I am not against regex, I am against using the least suitable tool for each case :-)

– hkotsubo

2018/09/27 at 21:33
@hkotsubo The problem is that it wants to remove <!DOCTYPE html> <html> <head> </head>, correct?! How to remove the opening of the tag <html> without removing the whole element using DOM handlers? Be the case there I agree to use handlers.

– Sam

2018/09/27 at 21:34
1

@But there it is: if the beginning is always <!DOCTYPE html> <html> <head> </head>, Isn’t it easier to use substar? And if the beginning is variable, maybe it is better to use a parser (unless the variations are simple, but without more examples, I have no way to evaluate what is best). In fact I would go back a step and see why it is necessary an html that does not have the tag html at first (this seems to be the real problem) :-)

– hkotsubo

2018/09/27 at 21:48
So what happens is this, there is the generation of a pdf using mpdf, it has 2 pages, but for some reason it was appearing only the first page. After a lot of searching I saw that the problem is in this html tag, it’s creating a whole html inside another, as test I took the field I reported above and put in hand even the string that was left, then it worked, and that’s why I need to rip these tags. I have no idea why they saved it like that in the bank but now I’m left to solve this mess.

– Cesar Vinicius

2018/09/27 at 22:10
Managed to solve?

– Sam

2018/09/28 at 11:18
@sam yes, I got it using strip_tags. Thank you.

– Cesar Vinicius

2018/09/28 at 11:57
1

But why can’t you have the <!DOCTYPE html> <html> <head> </head> if it is an html ? This seems very strange to me as @hkotsubo has already mentioned and only by following this path can you really resolve correctly. The solution you have to me seems like a patch, which may need more patches in the future.

– Isac

2018/09/28 at 14:52
@sam Just to complement, there is a problem when the regex nay der match in string. See here that regex has to backtrack (go back and forth several times because it hasn’t found a match). For a single case it may not even "tickle" performance, but if you have to process too many files, for example, it can already start to make a difference. Finally, it does not mean that for a single file the regex does not serve, but it is important to know the implications of using each approach: https://www.regular-expressions.info/catastrophic.html

– hkotsubo

2018/09/28 at 14:56
@Isac existed an html page, containing header, Divs style tags and everything else, and in one of these Divs was inserted the string I reported above. All this in a php file that generates a PDF that may or may not have more than 1 page. When it was just a page, wonderful, it worked normal, when there was more for some reason that I really do not know he did not create the second page for anything in the world, after much testing I saw that it was because of this html header that was coming in the string. I took the header and it worked normal.

– Cesar Vinicius

2018/09/28 at 19:15

Show 11 more comments

1 answer

Browser other questions tagged php regex

You are not signed in. Login or sign up in order to post.

by Alvaro Alves • **1,065** points · Answer 1 · 2018-09-27T21:21:02+00:00

Explaining the preg_match:

The function preg_match() accepts 5 parameters, the first two being mandatory.

The first parameter is the regular expression ($Pattern).
The second parameter is the string where we can search the expression ($Subject).
The third parameter is an array that stores the term you married ($Matches).

I ran a test on your code and look what came out:

$texto = "<!DOCTYPE html> <html> <head> </head> <body> <div> </div> <div> </div> <div> <h3> </h3> <p> </p> <p> </p> <p> </p> <p> </p> <p> </p> <br /><br /><br /><br /><br /><br /> <h6> </h6> <br /><br /><br /><br /> <p>Portaria nº 69 de 18/01/2017 - Publicada no DOU de 19/01/2017</p> <br /> <h3>Certificamos que</h3> <h5>{NOME_ALUNO}</h5> <h3>concluiu em {DT_APR} o <br />{NOME_CURSO}<br />realizado pela ---- na qualidade de aluno(a), perfazendo um total de {CARGA_HOR} horas.</h3> <h3><em> </em></h3> <h4><em>Cidade </em>{DATE_EXT}<em> .</em></h4> </div> </body> </html>";

$matches = array();

$resultado = preg_match('/^<!\w+\s\w+>\s<\w+>\s<\w+>\s<\/\w+>/', $texto, $matches);

var_dump($resultado, $matches);

Giving a var_dump see the result:

int(1) array(1) { [0]=> string(37) " " }

Explaining what you are trying with your REGEX:

^ Indicates that it is the initial position of the string

<! finds all literal characters <! (case sensitive)