Scan all pages, Curl

Asked

Viewed 918 times

0

I need to access a site through the Curl and capture content from his page, but it does not show all content on a single page, it divides them into several pages at the end it shows that menu to navigate to page 1, 2, 3, 4, I need to walk through ALL these pages in order to capture the content I desire, as I can accomplish?

The code of this "menu" (I forgot the name of it) is this:

<center><div class='wp-pagenavi'>
<span class='pages'>1 de 8</span><span class='current'>1</span><a class="page larger" href="/page/2/">2</a><a class="page larger" href="/page/3/">3</a><span class='extend'>...</span><a class="nextpostslink" rel="next" href="/page/2/">></a><a class="last" href="/page/8/">»</a>
</div></center>

In that case I would need to browse the 8 pages to get what I want, how to do?

  • What have you tried?

  • @Andre Ribeiro only logical

  • take a look at this answer: [Link][1]&#xD; &#xD; &#xD; [1]: http://answall.com/questions/43729/pega-um-valor-dentro-do-html-curlanswertab=votes##tab-top

1 answer

3


You need to make a request for each page and capture the content within it.

Assuming the start url is: http://site.com/page/1

Soon we can create a class to perform "Crawler" on all pages with a simple loop ( With the class already ready ):

<?php

/**
 * A simple crawler
 * By Rodrigo Nascimento
 * 
 */
set_time_limit(0);
error_reporting(E_ALL);

Class SimpleCrawler {

    private $url;
    private $userAgent;
    private $httpResponse;

    function __construct() {
        $this->userAgent       = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0";
        $this->chocolateCookie = "chocolateCookies.txt";
    }

    /**
     * Seta a url alvo
     * @param string $url
     * @return SimpleCrawler
     */
    public function setUrl($url) {
        $this->url = $url;
        return $this;
    }

    /**
     * Requisição get
     * @return SimpleCrawler
     */
    private function get(){
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $this->url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($ch, CURLOPT_USERAGENT, $this->userAgent);
        curl_setopt($ch, CURLOPT_COOKIEFILE, $this->chocolateCookie);
        curl_setopt($ch, CURLOPT_COOKIEFILE, $this->chocolateCookie);
        $this->httpResponse = curl_exec($ch);
        return $this;
    }

    /**
     * Pega o conteudo da requisição
     * @return SimpleCrawler
     */
    public function getPageContent() {
        // Aqui vc pode fazer o parse do content da página utilizando regex ou seja
        // lá qual for o método utilizado.
        echo "Page Content:\n\n",
             "{$this->httpResponse}\n\n";

        return $this;
    }

    /**
     * Faz a navegação na página especificado por self::setUrl
     * @return SimpleCrawler
     */
    public function navigate() {
        echo "Visiting: {$this->url}\n";
        $this->get();

        return $this;
    }
}

/* Estancia do nosso objeto que se baseia nos seguintes métodos:
 * 
 * Definir uma url: $simpleCrawler->setUrl('site');
 * Navegar em dada url: $simpleCrawler->navigate();
 * E por fim ter acesso ao conteúdo da requisição: $simpleCrawler->getPageContent();
 * 
 */
$simpleCrawler = new SimpleCrawler;

// à partir daqui podemos executar quantas requests quisermos.
// Já que precisamos do mesmo site basta um laço simples para efetuar a navegação
$pageNum = 8;

for ($i=1;$i<=$pageNum;$i++):
    $simpleCrawler->setUrl("http://site/page/{$i}")
                  ->navigate()
                  ->getPageContent();
endfor;

That should be enough to accomplish the mission (:

Browser other questions tagged

You are not signed in. Login or sign up in order to post.