Crawler for when http_status_code is different from 200


Viewed 127 times


I’m making a mini Crawler in . php using a library called "Phpcrawl" to do the Crawler function and the "simple_html_dom_parser" library to parse the html. The question is: simple_html_dom cannot parse when http_status_code is different from '200' (variable coming from phpcrawl) returning a Fatal error: Call to a member function find() on boolean in C:\xampp\htdocs\PHP\Crawler\modules\admin\controllers\Crawler.php on line 14

PHP code:

/* Configuracoes de conexao */

require_once '../../../library/PHPCrawl_083/libs/PHPCrawler.class.php';
require_once '../../../library/Simple_HTML_DOM/simple_html_dom.php';

//Extend the Class and Override the handleDocumentInfo() Method
class Crawler extends PHPCrawler{
    function handleDocumentInfo($DocInfo){
        echo '*******************************'.'<br />';
        //Print Page Title
        $html = str_get_html($DocInfo->content);
        $title = $html->find('title');
        echo $title[0]->plaintext.'<br />';

        //Print the URL and the HTTP-status-Code 
        echo 'Page requested: '.$DocInfo->url.' ('.$DocInfo->http_status_code.')'.'<br />';

        //Print the refering URL 
        echo 'Referer-page: '.$DocInfo->referer_url.'<br />';
        echo '*******************************'.'<br />';

        //Print if the content of the document was be recieved or not 
        if($DocInfo->received == true){
            echo "Content received: ".$DocInfo->bytes_received." bytes".'<br />';
            echo "Content not received".'<br />';
        echo '<br /><br />';

$crawler = new Crawler();

//URL to crawl 

//Only receive content of files with content-type "text/html" 

//Ignore links to pictures, dont even request pictures 
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 

//Store and send cookie-data like a browser does 

//Set the traffic-limit to 1 MB (in bytes, 
//for testing we dont want to "suck" the whole site) 
$crawler->setTrafficLimit(1000 * 1024); 

//Set a depth limit 

//Crawler will searches for links only on href

//Crawler will searches for links only inside <tags>

//Set timeout to establishing connection

//Set timeout to Server send a data

//Start the Crawl process

// At the end, after the process is finished, we print a short 
// report (see method getProcessReport() for more information) 
$report = $crawler->getProcessReport();

echo "Summary:".'<br />'; 
echo "Links followed: ".$report->links_followed.'<br />'; 
echo "Documents received: ".$report->files_received.'<br />'; 
echo "Bytes received: ".$report->bytes_received." bytes".'<br />'; 
echo "Process runtime: ".$report->process_runtime." sec".'<br />';

Part of the output printed in the browser

PHP: Context options and parameters - Manual 
Page requested: (200)
Content received: 20056 bytes

Links followed: 27
Documents received: 23
Bytes received: 1034007 bytes
Process runtime: 69.525975942612 sec
  • 1

    I don’t understand what the difficulty is. Just create a conditional if Else.. the value of http_status is in $Docinfo->http_status_code

  • Like Daniel said, enough with the job handleDocumentInfo check whether the value of $DocInfo->http_status_code, something like if ($DocInfo->http_status_code == 200 or $DocInfo->http_status_code == 302) {...}

  • If possible post an answer with the problem resolution. =)

  • 1

    @qmechanik, I gave no more attention to this question because soon I changed the architecture and the way of solving the problem as a whole, sorry for the inconvenience =D

1 answer


As stated in the comment it was only necessary to insert a conditional with the condition of the desired status_code so that it works perfectly.

function handleDocumentInfo($DocInfo){
    if ($DocInfo->http_status_code == 200){
        echo '*******************************'.'<br />';
        //Print Page Title
        $html = str_get_html($DocInfo->content);
        $title = $html->find('title');
        echo $title[0]->plaintext.'<br />';

        //Print the URL and the HTTP-status-Code 
        echo 'Page requested: '.$DocInfo->url.' ('.$DocInfo->http_status_code.')'.'<br />';

        //Print the refering URL 
        echo 'Referer-page: '.$DocInfo->referer_url.'<br />';
        echo '*******************************'.'<br />';

        //Print if the content of the document was be recieved or not 
        if($DocInfo->received == true){
            echo "Content received: ".$DocInfo->bytes_received." bytes".'<br />';
            echo "Content not received".'<br />';
        echo '<br /><br />';

Browser other questions tagged

You are not signed in. Login or sign up in order to post.