preg_match_all catch tag and attributes

Question

preg_match_all catch tag and attributes

Asked 4 years, 6 months ago

Viewed 39 times

2

Link appears like this:

< programme start="20210129023700 +0000" stop="20210129030100 +0000" channel="Foodnetworkhd.br">
< title lang="pt">Loucos por Churrasco - S3 E12 - Churrasco Tropical<\/title>
< desc lang="pt">Bobby Flay leva seu churrasco para as ilhas caribenhas com costela de porco grelhada, salada de radicchio e manga verde, além de batatas-doces grelhadas no estilo Hasselback e o coquetel Dark and Stormy. (n)</ desc>
</ programme>

In PHP it’s like this:

{

//AQUI ESTA A QUESTÃO
  preg_match_all('/(start="(?<dt_con>.*?)".+stop="(?<ed_con>.*?)".+channel="(?<ch_name>.*?)".+\n)/i', $response, $channels, PREG_SET_ORDER);

}

I’m using foreach to process the data.

1

Ever tried to use DomDocument or SimpleXml?

– Wallace Maxters

2021/01/28 at 16:33

1 answer

Browser other questions tagged php regex preg-match

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2021-01-28T19:56:01+00:00

Do not use regex to read XML/HTML/any-other-ML (see here and here for more details, and at the end there is a brief explanation about this).

Anyway, if you’re dealing with XML, a better option is to use a dedicated lib, for example, DOMDocument.

But since XML is poorly formed (because in fact they are only "loose tags", there is no "root" element encompassing all of them), the "ugly" solution (but that Stack Overflow itself indicates) is to read it as HTML and use libxml_use_internal_errors(true); to ignore the mistakes that say he’s ill-formed.

$xml = <<<TEXTO
<programme start="20210129023700 +0000" stop="20210129030100 +0000" channel="Foodnetworkhd.br">
<title lang="pt">Loucos por Churrasco - S3 E12 - Churrasco Tropical<\/title>
<desc lang="pt">Bobby Flay leva seu churrasco para as ilhas caribenhas com costela de porco grelhada, salada de radicchio e manga verde, além de batatas-doces grelhadas no estilo Hasselback e o coquetel Dark and Stormy. (n)</desc>
</programme>
TEXTO;

// ler o XML
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHtml($xml);

// procurar pelas tags "programme"
foreach ($dom->getElementsByTagName('programme') as $value) {
    // extrair os dados dela
    $nome = $value->nodeName; // nome da tag
    $start = $value->getAttribute('start'); // atributo start
    $stop = $value->getAttribute('stop'); // atributo stop
    $channel = $value->getAttribute('channel'); // atributo channel

    // usar os valores...
}

And why regex is not a good idea?

For a more detailed explanation, follow the links already cited at the beginning. But just to give a few examples, your regex would only work if the attributes start, stop and channel were exactly in this order.

If you change the order of the attributes, it no longer works. If the tag was spread across multiple lines, it would no longer work. If it were commented, it would also be considered by regex (already the DOMDocument would correctly ignore the tag).

Here has a more detailed example (although it is HTML, the same concerns apply).

Anyway, regex is nice, but is not always the best solution.

Just not to leave incomplete, to get the results of preg_match_all, just do:

if (preg_match_all('/<(\w+)\s+(start="(?<dt_con>.*?)".+stop="(?<ed_con>.*?)".+channel="(?<ch_name>.*?)".+\n)/i', $xml, $channels, PREG_SET_ORDER)) {
    foreach ($channels as $match) {
        $nome = $match[1];
        $start = $match['dt_con'];
        $stop = $match['ed_con'];
        $channel = $match['ch_name'];

        // usar os valores...
    }
}

How you used named groups, just use the names (dt_con, ed_con, etc) to get your values. I also modified the regex to get the tag name.

But as I said, this regex is "naive" and prone to failures (all mentioned above and in the links already cited). Prefer to use DOMDocument or any other parser of XML.