Syncing your .scrapy folder to an S3 bucket using DotScrapy Persistence

Modified on Wed, 3 Feb, 2021 at 7:56 AM

NOT TO BE CONFUSED WITH THE DOTSCRAPY PERSISTENCE ADDON

The purpose of this guide is to keep the content of the .scrapy directory in a persistent store, which is loaded when the spider starts and saved when the spider finishes. It allows spiders to share data between different runs, keeping a state or any kind of data that needs to be persisted. For more details on the middleware, you can check the github repository: scrapy-dotpersistence.

The .scrapy directory is well known in Scrapy and a few extensions use it to keep a state between runs. The canonical way to work with the .scrapy directory is by calling the scrapy.utils.project.data_path function, as illustrated in the following example:

from scrapy.utils.project import data_path

filename = 'data.txt'
mydata_path = data_path(filename)

# in a local project mydata_path will be /<SCRAPY_PROJECT>/.scrapy/data.txt
# on Scrapy Cloud mydata_path will be /Zyte/.scrapy/data.txt
# use mydata_path to store or read data which will be persisted among runs
# for instance:

if os.path.exists(mydata_path) and os.path.getsize(mydata_path) > 0:
    with open(mydata_path, 'r') as f:
        canned_cookie_jar = f.read()
        cookies_to_send = ast.literal_eval(canned_cookie_jar)

yield scrapy.Request(url='<SOME_URL>', callback=self.parse, cookies=cookies_to_send,)

Enabling DotScrapy Persistence

Enable the extension by adding the following settings to your settings.py:

EXTENSIONS = {
    ...
    'scrapy_dotpersistence.DotScrapyPersistence': 0
}

and

DOTSCRAPY_ENABLED = True

Configuring DotScrapy Persistence

Configure the extension through the following settings:

ADDONS_AWS_ACCESS_KEY_ID = 'ABC'
ADDONS_AWS_SECRET_ACCESS_KEY = 'DEF'
ADDONS_AWS_USERNAME = 'username' // This is the folder path (optional)
ADDONS_S3_BUCKET = 'my_bucket'

NOTE 1: the settings have a Prefix "ADDONS", not to be confused with AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY.

NOTE 2: When adding the settings through Zyte, please set them on Spider level. Setting them on Project level or in settings.py won't work because Zyte's default settings are propagated on Organization level and have lower priority than Spider level settings.

This way DotScrapy Persistence will sync your .scrapy folder to an S3 bucket following this format:

s3://my_bucket/username/org-<orgid>/<projectid>/dot-scrapy/<spidername>/