1
I’m making a mini Crawler in . php using a library called "Phpcrawl"
to do the Crawler function and the "simple_html_dom_parser" library
to parse the html.
The question is: simple_html_dom cannot parse when http_status_code is different from '200' (variable coming from phpcrawl) returning a Fatal error: Call to a member function find() on boolean in C:\xampp\htdocs\PHP\Crawler\modules\admin\controllers\Crawler.php on line 14
PHP code:
<?php
/* Configuracoes de conexao */
set_time_limit(10000);
require_once '../../../library/PHPCrawl_083/libs/PHPCrawler.class.php';
require_once '../../../library/Simple_HTML_DOM/simple_html_dom.php';
//Extend the Class and Override the handleDocumentInfo() Method
class Crawler extends PHPCrawler{
function handleDocumentInfo($DocInfo){
echo '*******************************'.'<br />';
//Print Page Title
$html = str_get_html($DocInfo->content);
$title = $html->find('title');
echo $title[0]->plaintext.'<br />';
//Print the URL and the HTTP-status-Code
echo 'Page requested: '.$DocInfo->url.' ('.$DocInfo->http_status_code.')'.'<br />';
//Print the refering URL
echo 'Referer-page: '.$DocInfo->referer_url.'<br />';
echo '*******************************'.'<br />';
//Print if the content of the document was be recieved or not
if($DocInfo->received == true){
echo "Content received: ".$DocInfo->bytes_received." bytes".'<br />';
}
else{
echo "Content not received".'<br />';
}
echo '<br /><br />';
}
}
$crawler = new Crawler();
//URL to crawl
$crawler->setURL("http://php.net/docs.php");
//Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule("#text/html#");
//Ignore links to pictures, dont even request pictures
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
//Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);
//Set the traffic-limit to 1 MB (in bytes,
//for testing we dont want to "suck" the whole site)
$crawler->setTrafficLimit(1000 * 1024);
//Set a depth limit
$crawler->setCrawlingDepthLimit(2);
//Crawler will searches for links only on href
$crawler->setLinkExtractionTags(array("href"));
//Crawler will searches for links only inside <tags>
$crawler->enableAggressiveLinkSearch(false);
//Set timeout to establishing connection
$crawler->setConnectionTimeout(60);
//Set timeout to Server send a data
$crawler->setStreamTimeout(60);
//Start the Crawl process
$crawler->go();
// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();
echo "Summary:".'<br />';
echo "Links followed: ".$report->links_followed.'<br />';
echo "Documents received: ".$report->files_received.'<br />';
echo "Bytes received: ".$report->bytes_received." bytes".'<br />';
echo "Process runtime: ".$report->process_runtime." sec".'<br />';
?>
Part of the output printed in the browser
******************************* PHP: Context options and parameters - Manual Page requested: http://php.net/manual/en/context.php (200) Referer-page: http://php.net ******************************* Content received: 20056 bytes Summary: Links followed: 27 Documents received: 23 Bytes received: 1034007 bytes Process runtime: 69.525975942612 sec
I don’t understand what the difficulty is. Just create a conditional if Else.. the value of http_status is in $Docinfo->http_status_code
– Daniel Omine
Like Daniel said, enough with the job
handleDocumentInfo
check whether the value of$DocInfo->http_status_code
, something likeif ($DocInfo->http_status_code == 200 or $DocInfo->http_status_code == 302) {...}
– stderr
If possible post an answer with the problem resolution. =)
– stderr
@qmechanik, I gave no more attention to this question because soon I changed the architecture and the way of solving the problem as a whole, sorry for the inconvenience =D
– Ricardo