Do not use regex to read XML/HTML/any-other-ML (see here and here for more details, and at the end there is a brief explanation about this).
Anyway, if you’re dealing with XML, a better option is to use a dedicated lib, for example, DOMDocument
.
But since XML is poorly formed (because in fact they are only "loose tags", there is no "root" element encompassing all of them), the "ugly" solution (but that Stack Overflow itself indicates) is to read it as HTML and use libxml_use_internal_errors(true);
to ignore the mistakes that say he’s ill-formed.
$xml = <<<TEXTO
<programme start="20210129023700 +0000" stop="20210129030100 +0000" channel="Foodnetworkhd.br">
<title lang="pt">Loucos por Churrasco - S3 E12 - Churrasco Tropical<\/title>
<desc lang="pt">Bobby Flay leva seu churrasco para as ilhas caribenhas com costela de porco grelhada, salada de radicchio e manga verde, além de batatas-doces grelhadas no estilo Hasselback e o coquetel Dark and Stormy. (n)</desc>
</programme>
TEXTO;
// ler o XML
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHtml($xml);
// procurar pelas tags "programme"
foreach ($dom->getElementsByTagName('programme') as $value) {
// extrair os dados dela
$nome = $value->nodeName; // nome da tag
$start = $value->getAttribute('start'); // atributo start
$stop = $value->getAttribute('stop'); // atributo stop
$channel = $value->getAttribute('channel'); // atributo channel
// usar os valores...
}
And why regex is not a good idea?
For a more detailed explanation, follow the links already cited at the beginning. But just to give a few examples, your regex would only work if the attributes start
, stop
and channel
were exactly in this order.
If you change the order of the attributes, it no longer works. If the tag was spread across multiple lines, it would no longer work. If it were commented, it would also be considered by regex (already the DOMDocument
would correctly ignore the tag).
Here has a more detailed example (although it is HTML, the same concerns apply).
Anyway, regex is nice, but is not always the best solution.
Just not to leave incomplete, to get the results of preg_match_all
, just do:
if (preg_match_all('/<(\w+)\s+(start="(?<dt_con>.*?)".+stop="(?<ed_con>.*?)".+channel="(?<ch_name>.*?)".+\n)/i', $xml, $channels, PREG_SET_ORDER)) {
foreach ($channels as $match) {
$nome = $match[1];
$start = $match['dt_con'];
$stop = $match['ed_con'];
$channel = $match['ch_name'];
// usar os valores...
}
}
How you used named groups, just use the names (dt_con
, ed_con
, etc) to get your values. I also modified the regex to get the tag name.
But as I said, this regex is "naive" and prone to failures (all mentioned above and in the links already cited). Prefer to use DOMDocument
or any other parser of XML.
Ever tried to use
DomDocument
orSimpleXml
?– Wallace Maxters