preg_match_all does not return all values

Asked

Viewed 88 times

1

I need to get the page titles of a txt but can’t get the correct output values with preg_match_all, always returns me only the result of the last IP line of the.txt file.

My code:

<?php  

$arquivo = fopen ("arquivo.txt", "r"); 
$num_linhas = 0; 

while (!feof ($arquivo)) {
    $linha=fgets($arquivo);
    if ($linha != "\n" && $linha != "") {
        $num_linhas++; $ultima = $linha;
    }
} 

fclose ($arquivo); 
$linhas = explode("\n", file_get_contents('arquivo.txt',null,null));
$ch = curl_init();
foreach ($linhas as $url) {
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_exec($ch);
    $file_contents = ob_get_contents();
    $resultado = preg_match_all('#<title>([^<\/]{1,})<\/title>#i', $file_contents, $matches);
}
ob_end_clean();
curl_close($ch);

print_r($matches);

When I execute print_r ($matches); in a list with only two IP’s returns me the last value only. As below:

Array
(
    [0] => Array
        (
            [0] => <TITLE>Page Not Found</TITLE>
        )

)

Contents of the.txt file

151.101.1.69
151.80.204.60

1 answer

1

This happens because of the parentheses in this section of regex: ([^<\/]{1,}).

The parentheses form a catch group, and according to the documentation of preg_match_all, in the array of pouch the groups are placed separately:

Orders Results so that $matches[0] is an array of full Pattern Matches, $matches[1] is an array of strings Matched by the first parenthesized subpattern, and so on.

That is, in $matches[0] i have an array with all the bit captured by regex, on $matches[1] I have the content captured by the first capture group, etc.(the groups are numbered in the order they appear in the regex, as yours only has a pair of parentheses, so you will only have one capture group).

So you can ignore $matches[1], or remove the capture group from its regex:

$file_contents = '<title>Fastly error: unknown domain 151.101.1.69</title>';
if (preg_match_all('#<title>[^<\/]+<\/title>#i', $file_contents, $matches)) {
    print_r ($matches);
}

I removed the parentheses, and I also changed the quantifier {1,} for +, which are equivalent (both correspond to "one or more occurrences"). The output is:

Array
(
    [0] => Array
        (
            [0] => <title>Fastly error: unknown domain 151.101.1.69</title>
        )

)

But actually, if you’re manipulating HTML, you’d better use DOMDocument:

$file_contents = '<title>Fastly error: unknown domain 151.101.1.69</title>';
$dom = new DOMDocument();
$dom->loadHtml($file_contents);
$list = $dom->getElementsByTagName("title");
if ($list->length > 0) {
    $title = $list->item(0);
    // imprimir a tag
    print_r($dom->saveHTML($title)); // <title>Fastly error: unknown domain 151.101.1.69</title>

    // pegar somente o conteúdo da tag
    echo $list->item(0)->textContent; // Fastly error: unknown domain 151.101.1.69
}

That’s because regex is not the best tool for manipulating HTML (for simpler cases may even "work", but also terrible things can happen). Finally, use the most suitable tool for each case, regex is not always the best solution.


Besides, you are calling preg_match_all within a loop foreach, but is printing the result out of the loop. This way it will only print the last result. If you want to print the result of all calls, put the print_r within the loop:

foreach ( $linhas as $url ) {
    curl_setopt( $ch, CURLOPT_URL, $url );
    curl_exec($ch);
    $file_contents = ob_get_contents();
    if (preg_match_all('#<title>[^<\/]+<\/title>#i', $file_contents, $matches)) {
        print_r ($matches);
    }
}

Also note the if to check whether preg_match_all found something (if not found, it does not enter the if, because then there won’t be anything to print).

  • Thank you for your reply. In relation to Regex, I actually had an error, but it continues to print only the title of the last.txt file value.?

  • @Antôniofagundes Please edit the question and place the contents of the file. If it is too large, reduce it, but so that the error remains.

  • Edited as requested.

  • @Well, I think the problem is that the print_r is out of the foreach. I updated the answer

  • That way it doesn’t recognize any input, it doesn’t enter the if by what it seems to me. How is that possible?

  • @Antôniofagundes I don’t know, I can’t reproduce the same mistake here. Maybe it’s some other detail, I don’t know. Tried calling directly with fixed values instead of reading from the file?

  • The same thing happens.

  • @Antoniophagundes Perhaps the curl is not returning the whole file (it’s a guess, because I can’t tell what’s in each IP - because if you have one title, regex returns, so if it does not enter if is because there is no title in the text)

  • I posted the ip address on the question at the end of it. It has the title tag on both, but returns me blank.I’m thinking it’s the firewall of my network.Funny that by the browser I can access it.

  • @So that must be the problem (something related to the network)

Show 5 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.