preg_match_all does not return all values

Question

preg_match_all does not return all values

Asked 5 years, 5 months ago

Viewed 88 times

1

I need to get the page titles of a txt but can’t get the correct output values with preg_match_all, always returns me only the result of the last IP line of the.txt file.

My code:

<?php  

$arquivo = fopen ("arquivo.txt", "r"); 
$num_linhas = 0; 

while (!feof ($arquivo)) {
    $linha=fgets($arquivo);
    if ($linha != "\n" && $linha != "") {
        $num_linhas++; $ultima = $linha;
    }
} 

fclose ($arquivo); 
$linhas = explode("\n", file_get_contents('arquivo.txt',null,null));
$ch = curl_init();
foreach ($linhas as $url) {
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_exec($ch);
    $file_contents = ob_get_contents();
    $resultado = preg_match_all('#<title>([^<\/]{1,})<\/title>#i', $file_contents, $matches);
}
ob_end_clean();
curl_close($ch);

print_r($matches);

When I execute print_r ($matches); in a list with only two IP’s returns me the last value only. As below:

Array
(
    [0] => Array
        (
            [0] => <TITLE>Page Not Found</TITLE>
        )

)

Contents of the.txt file

151.101.1.69
151.80.204.60

1 answer

Browser other questions tagged php regex preg-match

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2020-03-04T15:40:23+00:00

This happens because of the parentheses in this section of regex: ([^<\/]{1,}).

The parentheses form a catch group, and according to the documentation of preg_match_all, in the array of pouch the groups are placed separately:

Orders Results so that $matches[0] is an array of full Pattern Matches, $matches[1] is an array of strings Matched by the first parenthesized subpattern, and so on.

That is, in $matches[0] i have an array with all the bit captured by regex, on $matches[1] I have the content captured by the first capture group, etc.(the groups are numbered in the order they appear in the regex, as yours only has a pair of parentheses, so you will only have one capture group).

So you can ignore $matches[1], or remove the capture group from its regex:

$file_contents = '<title>Fastly error: unknown domain 151.101.1.69</title>';
if (preg_match_all('#<title>[^<\/]+<\/title>#i', $file_contents, $matches)) {
    print_r ($matches);
}

I removed the parentheses, and I also changed the quantifier {1,} for +, which are equivalent (both correspond to "one or more occurrences"). The output is:

Array
(
    [0] => Array
        (
            [0] => <title>Fastly error: unknown domain 151.101.1.69</title>
        )

)

But actually, if you’re manipulating HTML, you’d better use DOMDocument:

$file_contents = '<title>Fastly error: unknown domain 151.101.1.69</title>';
$dom = new DOMDocument();
$dom->loadHtml($file_contents);
$list = $dom->getElementsByTagName("title");
if ($list->length > 0) {
    $title = $list->item(0);
    // imprimir a tag
    print_r($dom->saveHTML($title)); // <title>Fastly error: unknown domain 151.101.1.69</title>

    // pegar somente o conteúdo da tag
    echo $list->item(0)->textContent; // Fastly error: unknown domain 151.101.1.69
}

That’s because regex is not the best tool for manipulating HTML (for simpler cases may even "work", but also terrible things can happen). Finally, use the most suitable tool for each case, regex is not always the best solution.

Besides, you are calling preg_match_all within a loop foreach, but is printing the result out of the loop. This way it will only print the last result. If you want to print the result of all calls, put the print_r within the loop:

foreach ( $linhas as $url ) {
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_exec($ch);
    $file_contents = ob_get_contents();
    if (preg_match_all('#<title>[^<\/]+<\/title>#i', $file_contents, $matches)) {
        print_r ($matches);
    }
}

Also note the if to check whether preg_match_all found something (if not found, it does not enter the if, because then there won’t be anything to print).