Here are a few things that work differently on Scrapy Cloud, compared to a default Scrapy configuration:
- AutoThrottle extension is enabled, to crawl websites politely
- JOBDIR is set, causing the scheduler (requests queue) to be persisted on disk, and save memory
- LOG_LEVEL is set to
INFO
Note that Scrapy Cloud servers are located in Germany -- this may pose an obstacle when targeting websites that restrict access based on geolocation. In such cases, using a proxy service is recommended, e.g. Zyte Smart Proxy Manager(formerly Crawlera) .
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article