Can I use an HTTP cache on Scrapy Cloud?

Modified on Wed, 3 Feb, 2021 at 8:24 AM

Yes, you can.

To do that, you have to enable Scrapy's HTTP cache extension by setting HTTPCACHE_ENABLED to True in your project settings.

The default behavior of this extension is to save the cached pages in the filesystem. When you run a spider locally with the HTTP cache enabled, the extension will create a .scrapy/httpcache folder inside your local project and store your requests and their responses in local files. To make it work on Scrapy Cloud, you'll have to enable the DotScrapy Persistence addon, which allows your spiders to access a persistent storage on Scrapy Cloud.

But, if you need to store a large volume of requests/responses, you should change your HTTP cache backend to DBM, as described in the HTTP Cache documentation. Otherwise, you can easily exceed your quota of disk nodes given the high amount of small files that the HTTP Cache extension creates.

Check out the official documentation to learn more about the HTTP Cache middleware.