Content tracker for external pages with PHP

Asked

Viewed 1,354 times

2

I received the mission to create a script that will capture the price, image and content of products of some sites indicated by the administrator of the application, taking into account that the structure of each of these sites is different, and that the script would need to scan all pages related to product category or sub pages from a given address (e.g.: /shirts/, /shirts/black, /shirts/blue). At first I thought I could do this using PHP’s Domxpath + Curl to search for areas related to products, but it doesn’t seem the right way.

Could you tell me where to start, what to use to create something like this?

  • It seems to be the right way. It would be better to use a headless browser, or better still ask permission to the websites visited expecting from them an API.

  • In case it would be with permission to the websites visited, however, I can not count on the use of Apis, so it should even be a tracker.

  • Not even something from the framework they use? An XML output? If they optimized code for javascript-free users use an HTML parser, if you don’t use a headless browser.

  • By the way you will have to understand the structure of each site and then make the libraries to read the data you need. With these limitations it will be really tricky. web content Miner and see if there’s any you can use

1 answer

3

You want to actually create a Web Crawler.

There is a PHP library for creating Web Crawlers:

http://phpcrawl.cuab.de/

Translating the example from the main website:

<?php 

// Tempo de atuação do crawler 
set_time_limit(10000); 

// Inclusão da classe principal
include("libs/PHPCrawler.class.php"); 

// Extendendo a classe principal e fazendo override no método handleDocumentInfo()
class MyCrawler extends PHPCrawler  
{ 
  function handleDocumentInfo($DocInfo)  
  { 
    // Dectecta quebra de linha na saída ("\n" em modo CLI, "<br>" em outros casos). 
    if (PHP_SAPI == "cli") $lb = "\n"; 
    else $lb = "<br />"; 

    // Imprime URL e Status HTTP
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb; 

    // Imprime URL referenciada
    echo "Referer-page: ".$DocInfo->referer_url.$lb; 

    // Imprime se conteúdo do documento foi recebido ou não. 
    if ($DocInfo->received == true) 
      echo "Content received: ".$DocInfo->bytes_received." bytes".$lb; 
    else 
      echo "Content not received".$lb;  

    // O conteúdo da página está em $DocInfo->source

    echo $lb; 

    flush(); 
  }  
} 

// Crie uma instância da sua classe, defina o comportamento do crawler
// e inicie o processo.

$crawler = new MyCrawler(); 

// URL para realizar o crawling
$crawler->setURL("www.php.net"); 

// Faz o crawl apenas de documentos content-type "text/html" 
$crawler->addContentTypeReceiveRule("#text/html#"); 

// Ignorar imagens
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 

// Armazenar cookies
$crawler->enableCookieHandling(true); 

// Baixar apenas 1 megabyte do site (não precisa baixar tudo)
$crawler->setTrafficLimit(1000 * 1024); 

// Se tudo está ok, só chamar o método go()
$crawler->go(); 

// Para imprimir um relatório do processo, use o método abaixo
$report = $crawler->getProcessReport(); 

if (PHP_SAPI == "cli") $lb = "\n"; 
else $lb = "<br />"; 

echo "Sumário:".$lb; 
echo "Links seguidos: ".$report->links_followed.$lb; 
echo "Documentss recebidos: ".$report->files_received.$lb; 
echo "Bytes recebidos: ".$report->bytes_received." bytes".$lb; 
echo "Tempo de execução: ".$report->process_runtime." sec".$lb;  
?>
  • Gypsy Morrison Mendez, I can pick up images with this lib ?

  • Can. This Crawler can download images. Only in the example I put that it ignores, in case of a Crawler that deals only with text.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.