I need to retrieve the information from a page. How can I continue what I started?

Asked

Viewed 197 times

0

<?php 

header('Content-Type: text/html; charset=utf-8');

$ch = curl_init();
$timeout = 0;
curl_setopt($ch, CURLOPT_URL, 'http://www.cidades.ibge.gov.br/xtras/uf.php?lang=&coduf=17&search=tocantins');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$conteudo = curl_exec ($ch);
curl_close($ch);

highlight_string($conteudo);

?>

All the content of the page I am recovering is within $content. Within that <ul> will have tens or hundreds of results and I need to catch them all.

<ul id="lista_municipios">
    <li id="">
        <a href="perfil.php?lang=&codmun=170025&search=tocantins|item1">item2</a>
    </li>
    <li>....
    <li>....
</ul>

I need to get item1 and item2.

  • 1

    If I understand correctly, you can use phpQuery, it simulates jquery only in php...

  • 1

    Think I can get to the goal easier using phpQuery?

  • 1

    It is one of the alternatives, I think it is well of personal opinion this choice, I will put an example that I used in ZIP consultation for you to analyze...

1 answer

1


Here is an example using phpQuery-one-file for ZIP query; the Curl part is not included as the focus is on using phpQuery; this is one of several possible solutions.

phpQuery: https://code.google.com/p/phpquery/

$body = $client->send($request)->getBody(); //Aqui seria seu HTML
    //Inclusão do phpQuery
    if (!method_exists('phpQuery', 'newDocumentHTML'))
        require_once __DIR__ . DIRECTORY_SEPARATOR . 'phpQuery-onefile.php';
    //Inicialização do documento, substitua $body pela sua variável contendo o HTML;
    $doc = \phpQuery::newDocumentHTML($body, $charset = 'utf-8');
    $resultados = [];
        //Itera sobre as linhas da tabela;
        foreach(\phpQuery::pq('table[cellpadding="5"]')->find('tr') as $linha) {
            $dados = [];
            foreach(\phpQuery::pq($linha)->find('td') as $coluna) {
                $valor = htmlspecialchars_decode(trim(preg_replace('/\s+/', ' ', \phpQuery::pq($coluna)->html())));
                $dados[] = $valor;
            }
            $dadosFinal['logradouro'] = $dados[0];
            $dadosFinal['bairro'] = $dados[1];
            $dadosFinal['localidade'] = $dados[2];
            $dadosFinal['uf'] = $dados[3];
            $dadosFinal['cep'] = $dados[4];
            $resultados[] = $dadosFinal;
        }
return $resultados;

Applying your need, would do something like:

//Inclusão do phpQuery
if (!method_exists('phpQuery', 'newDocumentHTML'))
    require_once __DIR__ . DIRECTORY_SEPARATOR . 'phpQuery-onefile.php';
//Inicialização do documento, substitua $body pela sua variável contendo o HTML;
$doc = \phpQuery::newDocumentHTML($body, $charset = 'utf-8');

foreach(\phpQuery::pq('ul#lista_municipios')->find('li') as $linha) {
    $valor = htmlspecialchars_decode(\phpQuery::pq($linha)->html());//item2
    $valorAttr = htmlspecialchars_decode(\phpQuery::pq($linha)->attr('href')); //Item1 (valor do href)
    $item1 = explode('|', $valorAttr)[1]; //mantive $valorAttr caso você precise. 
}

In the end it would need tests and adaptations for your need;

  • 1

    That cool ... I didn’t understand anything. But like: where he starts reading the variable $conteudo that is all the data on the page?

  • 1

    The $body variable, takes all the html content from the page, where the newDocumentHTML line is, it initializes its object with the html passed by $body; I changed the answer now by commenting a little better on the line where it initializes everything.. I’m sorry if I haven’t made myself clear...

  • As shown here http://pastebin.com/vwJyg2bi, I was able to count how many records there are inside this ul. I had him search inside li the hyperlinks(<a>) which is where all the content is that needs to be recovered, in this case ìtem1 and item2. You can give me one more strength to recover ìtem1 and item2?

  • @Marcosvinicius Try something like: $valor = htmlspecialchars_decode(\phpQuery::pq($linha)->html()); So you get your value back, $valorAttr = htmlspecialchars_decode(\phpQuery::pq($linha)->attr('href')); Recover the value of href and then apply the filter and take its value from item1

  • Very little missing: profile.php? lang=&codmun=170025&search=Tocantins|Abreulandia ... to get abreulandia I’m gonna need a regex?

  • 1

    @Marcosvinicius You can use the explode(), and take the second position of the array.

Show 1 more comment

Browser other questions tagged

You are not signed in. Login or sign up in order to post.