You can use the extension Autothrottle, trying to optimize crawling speed based on estimates of server load and scrapy processing.
Using this extension (code here), you can define a CONCURRENT_REQUESTS_PER_IP
(or CONCURRENT_REQUESTS_PER_DOMAIN
) maximum, and the actual limits shall be set dynamically according to the performance measured at runtime. The throttling algorithm takes into account latency.
Otherwise, to find a better configuration you will have to test different combinations of competing request limits per IP/domain, download delay and CPU load.
It’s hard to define a recipe to do this manually, because it depends a lot on the type of crawling you’re doing. For example, if you are browsing several different websites, you may want to use different settings for each. If you are only hosting a website, you will need to take into account the limitations of requests. And so on, each situation will have to be analyzed separately.
Many websites impose a maximum of requests per IP per time interval, so it usually makes sense to set together CONCURRENT_REQUESTS_PER_IP
and DOWNLOAD_DELAY
, respecting the limitations of the site.