Content tracker for external pages with PHP

Question

Content tracker for external pages with PHP

Asked 11 years, 5 months ago

Viewed 1,354 times

2

I received the mission to create a script that will capture the price, image and content of products of some sites indicated by the administrator of the application, taking into account that the structure of each of these sites is different, and that the script would need to scan all pages related to product category or sub pages from a given address (e.g.: /shirts/, /shirts/black, /shirts/blue). At first I thought I could do this using PHP’s Domxpath + Curl to search for areas related to products, but it doesn’t seem the right way.

Could you tell me where to start, what to use to create something like this?

It seems to be the right way. It would be better to use a headless browser, or better still ask permission to the websites visited expecting from them an API.

– Gustavo Rodrigues

2014/03/06 at 19:07
In case it would be with permission to the websites visited, however, I can not count on the use of Apis, so it should even be a tracker.

– Rafael Alexandre

2014/03/06 at 19:14
Not even something from the framework they use? An XML output? If they optimized code for javascript-free users use an HTML parser, if you don’t use a headless browser.

– Gustavo Rodrigues

2014/03/06 at 19:17
By the way you will have to understand the structure of each site and then make the libraries to read the data you need. With these limitations it will be really tricky. web content Miner and see if there’s any you can use

– Erlon Charles

2014/03/06 at 19:31

1 answer

Browser other questions tagged php

You are not signed in. Login or sign up in order to post.

by Leonel Sanches da Silva • **88,623** points · Answer 1 · 2014-03-06T20:03:18+00:00

You want to actually create a Web Crawler.

There is a PHP library for creating Web Crawlers:

http://phpcrawl.cuab.de/

Translating the example from the main website:

<?php 

// Tempo de atuação do crawler 
set_time_limit(10000); 

// Inclusão da classe principal
include("libs/PHPCrawler.class.php"); 

// Extendendo a classe principal e fazendo override no método handleDocumentInfo()
class MyCrawler extends PHPCrawler  
{ 
  function handleDocumentInfo($DocInfo)  
  { 
    // Dectecta quebra de linha na saída ("\n" em modo CLI, "<br>" em outros casos). 
    if (PHP_SAPI == "cli") $lb = "\n"; 
    else $lb = "<br />"; 

    // Imprime URL e Status HTTP
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb; 

    // Imprime URL referenciada
    echo "Referer-page: ".$DocInfo->referer_url.$lb; 

    // Imprime se conteúdo do documento foi recebido ou não. 
    if ($DocInfo->received == true) 
      echo "Content received: ".$DocInfo->bytes_received." bytes".$lb; 
    else 
      echo "Content not received".$lb;  

    // O conteúdo da página está em $DocInfo->source

    echo $lb; 

    flush(); 
  }  
} 

// Crie uma instância da sua classe, defina o comportamento do crawler
// e inicie o processo.

$crawler = new MyCrawler(); 

// URL para realizar o crawling
$crawler->setURL("www.php.net"); 

// Faz o crawl apenas de documentos content-type "text/html" 
$crawler->addContentTypeReceiveRule("#text/html#"); 

// Ignorar imagens
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 

// Armazenar cookies
$crawler->enableCookieHandling(true); 

// Baixar apenas 1 megabyte do site (não precisa baixar tudo)
$crawler->setTrafficLimit(1000 * 1024); 

// Se tudo está ok, só chamar o método go()
$crawler->go(); 

// Para imprimir um relatório do processo, use o método abaixo
$report = $crawler->getProcessReport(); 

if (PHP_SAPI == "cli") $lb = "\n"; 
else $lb = "<br />"; 

echo "Sumário:".$lb; 
echo "Links seguidos: ".$report->links_followed.$lb; 
echo "Documentss recebidos: ".$report->files_received.$lb; 
echo "Bytes recebidos: ".$report->bytes_received." bytes".$lb; 
echo "Tempo de execução: ".$report->process_runtime." sec".$lb;  
?>