The Auto Throttle addon makes spiders crawl the target sites with more caution, by dynamically adjusting request concurrency and delay according to the site lag and user control parameters. For more details see the Scrapy Autothrottle documentation.
This addon is enabled by default in every Scrapy Cloud project. The basic settings controlling its behaviour are:
CONCURRENT_REQUESTS_PER_DOMAIN
- limits the maximum number of concurrent requests sent to the same host domain (default value is 8)DOWNLOAD_DELAY
- limits the minimum download delay (in seconds) between each request to a given domain (default value is 0)AUTOTHROTTLE_ENABLED
- enables or disables the Autothrottle addon (default value is True, i.e. enabled)
Adjusting Auto Throttle settings
The settings depend on the user’s needs, there are no values that will work for every website. The default values are in general a good starting point and most servers tolerate them. Still there’s a possibility of blocking and a need to slow down the crawling rate may emerge. Or quite the contrary, you may want the bot to crawl faster, in such case you should fully realize that the risk of blocking increases.
The crawling rate may be slowed down by adjusting the maximum concurrency CONCURRENT_REQUESTS_PER_DOMAIN
to 1
, and increasing the minimum download delay DOWNLOAD_DELAY
at will. Regarding the maximum effective crawling rate, in practice it will be limited to the target server response rate, but may try to speed it up by randomly increasing maximum concurrency (although in reality it produces no significant effect as concurrency will hardly exceed 2 for most sites).
As Auto Throttle dynamically adjusts delay and concurrency depending on the website response delay, the parameters only define limits while not forcing values. The minimum download delay value will not let the effective download delay take lower values during crawling, and the maximum concurrency value will not let the effective concurrency take higher ones. If there’s a need for fixed values, Auto Throttle and its functionality of adjusting effective parameters during crawling have to be disabled by setting AUTOTHROTTLE_ENABLED
to False
. Under such conditions, the settings CONCURRENT_REQUESTS_PER_DOMAIN
and DOWNLOAD_DELAY
may be redefined with required values.
But be warned, you will be doing so at your own risk – as stated before, increasing the crawling rate results in considerably increasing the probability of being blocked by the target site or your Zyte account getting suspended.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article