How to calculate an optimal value for Scrapyd’s CONCURENT_REQUESTS variable?

Question

How to calculate an optimal value for Scrapyd’s CONCURENT_REQUESTS variable?

Asked 10 years, 10 months ago

Viewed 126 times

4

One of the default settings in Scrapyd is the number of concurrent processes (is 16).

CONCURRENT_REQUESTS = 16

What would be the best methodology to calculate an optimal value for this variable?

The goal is to get the best performance regarding the use of processing and memory versus number of crawled pages.

2 answers

2

You can use the extension Autothrottle, trying to optimize crawling speed based on estimates of server load and scrapy processing.

Using this extension (code here), you can define a CONCURRENT_REQUESTS_PER_IP (or CONCURRENT_REQUESTS_PER_DOMAIN) maximum, and the actual limits shall be set dynamically according to the performance measured at runtime. The throttling algorithm takes into account latency.

Otherwise, to find a better configuration you will have to test different combinations of competing request limits per IP/domain, download delay and CPU load.

It’s hard to define a recipe to do this manually, because it depends a lot on the type of crawling you’re doing. For example, if you are browsing several different websites, you may want to use different settings for each. If you are only hosting a website, you will need to take into account the limitations of requests. And so on, each situation will have to be analyzed separately.

Many websites impose a maximum of requests per IP per time interval, so it usually makes sense to set together CONCURRENT_REQUESTS_PER_IP and DOWNLOAD_DELAY, respecting the limitations of the site.

Browser other questions tagged python web-application web-crawler scrapy

You are not signed in. Login or sign up in order to post.

by Renan Gomes • 11 points · Answer 1 · 2016-04-02T08:19:36+00:00

The number can vary a lot from server to server. And, if you are creating a generic server to index multiple sites, it gets even harder. Normally I keep increasing until I can make my CPU close to 100% in one of the colors. So I can make sure that the algorithm has become limited by the CPU and no longer by the network.

If you want to reduce the memory used, simply enable the configuration JOBDIR that memory consumption will no longer grow.

I had no success with the AutoThrottle. The extension seemed to me very simple and most of the time the speed does not get too below the optimum speed.