Error while retrieving content from external website

Asked

Viewed 159 times

0

I have a code that searches a content inside an external site in the G1 case, it works perfectly but brings me with the page CSS so I can not customize to leave in the pattern of my site, searching found another that shows me the data without formatting giving me freedom to manipulated in the best possible way, but when I request an extensive content it does not open me giving me the following error.

Notice: Undefined offset: 2 in C:\xampp\htdocs\ruralrio\blog\2.php on line 15

my code and this

<?php

$url_base = "http://g1.globo.com/economia/noticia/2015/07/justica-paulista-suspende-multa-de-r-3-milhoes-ao-mcdonalds.html";
$texto = preg_replace("/((\r\n|\t)+|\s{2,})/", "",
file_get_contents($url_base));

preg_match('/<title>(.*)<\/title>/i', stripslashes($texto), $titulo);
preg_match('/<h1 class="entry-title">(.*)<\/h1>/i', stripslashes($texto), $titulomateria);
preg_match('/<h2>(.*)<\/h2>/i', stripslashes($texto), $titulomateria2);
preg_match('/<div class="materia-conteudo entry-content" id="materia-letra">(.*)<\/div>/i', stripslashes($texto), $titulomateria3);

echo strip_tags($titulo[1]) . "<br /><br />";
echo strip_tags($titulomateria[1]) . "<br /><br />";
echo strip_tags($titulomateria2[1]) . "<br /><br />";
echo strip_tags($titulomateria3[2]) . "<br /><br />";

?>

1 answer

1


Possible problems:

  1. file_get_contents is not enabled to access external urls, to fix use:

    Edit php.ini and change it allow_url_fopen=0 for allow_url_fopen=1 (http://php.net/manual/en/filesystem.configuration.php)

  2. file_get_contents requires context with user-agent, so it is necessary to do something like:

    $headers = array(
        'Accept-language: pt-br',
        'User-Agent: ' . $_SERVER['HTTP_USER_AGENT']
    );
    
    $opts = array(
        'http'=>array(
            'method' => 'GET',
            'header' => implode(PHP_EOL, $headers)
        )
    );
    
    $context = stream_context_create($opts);
    
    $texto = file_get_contents('http://g1.globo.com/economia/noticia/2015/07/justica-paulista-suspende-multa-de-r-3-milhoes-ao-mcdonalds.html', false, $context);
    
  3. Instead of using preg_match try using DOM, for example:

    $doc = new DOMDocument();
    
    //Modifica o estado
    $libxml_previous_state = libxml_use_internal_errors(true);
    
    //Faz um parse na string
    $doc->loadHTML($texto);
    
    //Limpa os erros
    libxml_clear_errors();
    
    //Restaura ao normal
    libxml_use_internal_errors($libxml_previous_state);
    

    Source: https://stackoverflow.com/a/17559716/1518921

    And then use methods like getelementsbytagname, getelementbyid and Domxpath (to facilitate)

The final code should look something like:

<?php
$headers = array(
    'Accept-language: pt-br',
    'User-Agent: ' . $_SERVER['HTTP_USER_AGENT']
);

$opts = array(
    'http'=>array(
        'method' => 'GET',
        'header' => implode(PHP_EOL, $headers)
    )
);

$context = stream_context_create($opts);

$texto = file_get_contents('http://g1.globo.com/economia/noticia/2015/07/justica-paulista-suspende-multa-de-r-3-milhoes-ao-mcdonalds.html', false, $context);

$doc = new DOMDocument();

// modify state
$libxml_previous_state = libxml_use_internal_errors(true);

// parse
$doc->loadHTML($texto);

// handle errors
libxml_clear_errors();

// restore
libxml_use_internal_errors($libxml_previous_state);

$tmp = $doc->getElementsByTagName('title');

foreach ($tmp as $value) {
    echo 'Titulo:', $value->nodeValue, '<br>';
}

$xpath = new DOMXPath($doc);

$tmp = $xpath->query('//h1[contains(@class,"entry")]');

foreach ($tmp as $value) {
    echo 'h1.entry:', $value->nodeValue, '<br>';
}

$tmp = $doc->getElementsByTagName('h2');

foreach ($tmp as $value) {
    echo 'h2:', $value->nodeValue, '<br>';
}

$tmp = $doc->getElementById('materia-letra')->getElementsByTagName('div');

foreach ($tmp as $value) {
    echo '#materia-letra:', $value->nodeValue, '<br>';
}
  • Good afternoon @guilhermenascimento tested here this way you passed me but generated another error and does not open the rest of the page Parse error: syntax error, unexpected ''User-Agent: '' (T_CONSTANT_ENCAPSED_STRING), expecting ')' in C:\xampp\htdocs\ruralrio\blog\2.php on line 6

  • I continue with the same error :(, which may be in, because small text it picks up normally but large text no longer. already the other code that does not have as I customize displays everything without flaws and correctly but the one that I want to separate the contents already generates these errors

  • I don’t know what’s going on here in my most ever the same mistake. kkk it returns to me like this Economia - Justiça paulista suspende multa de R$ 3 milhões ao McDonald's&#xA;&#xA;Justiça paulista suspende multa de R$ 3 milhões ao McDonald's&#xA;&#xA;Recurso do Procon-SP foi recusado pelo Tribunal de Justiça de São Paulo.Entidade alega que rede veiculou comerciais abusivos do McLanche Feliz.&#xA;&#xA;&#xA;Notice: Undefined offset: 2 in C:\xampp\htdocs\ruralrio\blog\2.php on line 27

  • Dude, I’m just using this code with nothing else on it. on the web server it does not generate the error and does not display the 4th item already in the location it shows that it has this error

  • the page with only the texts [http://ruralrio.com.br/blog/2.php] and the page with all the tags and css [http://ruralrio.com.br/blog/index.g1economia2.php?url=http://G1.globo.com/economia/noticia/2015/07/justica-paulista-suspende-multa-de-r-3-milhoes-ao-Mcdonalds.html]

  • I continued with the preg_match because when putting the DOM generated more errors because I do not know how to use the DOM correctly

  • preg_match is much more complex and not very recommended for this (at least most people agree). Which error generated in DOM?

  • Warning: Domdocument::loadHTML(): Tag header invalid in Entity, line: 151 in C: xampp htdocs ruralrio blog 2.php on line 19 Warning: Domdocument::loadHTML(): ID frmBuscaScroll already defined in Entity, line: 237 in C: xampp htdocs ruralrio blog 2.php on line 19 Warning: Domdocument::loadHTML(): ID menu-2-noticias already defined in Entity, line: 677 in C: xampp htdocs ruralrio blog 2.php on line 19 Warning: Domdocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 4552 in C: xampp htdocs ruralrio blog 2.php on line 19

  • Now cleaned the bugs no more build the page gets blank

  • Just what I needed Thank you very much I will study here how it works..

  • @Cristianocardososilva study yes :) DOM is what there is kkk

Show 7 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.