PHP Crawlers for external websites API Phpcrawl

Question

PHP Crawlers for external websites API Phpcrawl

Asked 9 years, 8 months ago

Viewed 425 times

1

Good evening person I’m new to the subject, I’m trying to build a search engine for external sites (indexer) with PHP, I found an API, which provides a Crawler, but it seems to only search for things within a specific site, the name of the API is Phpcrawl, would like someone who has knowledge in this tool, can tell me if it is possible to search other external sites, and not just tags within one. http://phpcrawl.cuab.de/about.html <- this is the API thank you in advance

1 answer

Browser other questions tagged php api quest web-crawler

You are not signed in. Login or sign up in order to post.

by Guilherme Nascimento • **98,651** points · Answer 1 · 2015-12-09T14:23:59+00:00

But this is basically what Crawler should do, it will be up to you to use a database with the list of sites you want to scan and a cron to schedule the scans, each one cron preferably to schedule the scans, in this script you would pass the argument of the site you want to scan, for example: $crawler->setURL($argv[1]).

Don’t expect a single php request to process numerous websites, it will be bad for your server, Google, Yahoo, Bing periodically scan different sites and routines and they probably have a limit of scanning one site per hour and continue only after.

If only a request and a php script tried to access multiple urls, the application would be in a long process that could take hours and depending not on the Garbage Collection (GC) PHP would not be able to clean up the use which would cause the processor consumption or memory to increase until your server starts crashing.

The most appropriate way (not necessarily the right way) is to scan one site at a time and set a limit and try to pick up where you left off if you’re going to use the limit. Remember there are sites that may have more than 50,000 pages.