Here are a few things that work differently on Scrapy Cloud, compared to a default Scrapy configuration:
- AutoThrottle extension is enabled, to crawl websites politely
- JOBDIR is set, causing the scheduler (requests queue) to be persisted on disk, and save memory
- LOG_LEVEL is set to
Note that Scrapy Cloud servers are located in Germany -- this may pose an obstacle when targeting websites that restrict access based on geolocation. In such cases, using a proxy service is recommended, e.g. Zyte Smart Proxy Manager(formerly Crawlera) .