Incremental crawls with Scrapy and DeltaFetch in Scrapy Cloud

Modified on Wed, 3 Feb, 2021 at 7:54 AM

NOT TO BE CONFUSED WITH THE DELTAFETCH AND DOTSCRAPY PERSISTENCE ADDONS

The purpose of this is to avoid requesting pages that have already scraped items in previous crawls of the same spider, thus producing a delta crawl containing only new items. For more details on the middleware, you can check the github repository: scrapy-deltafetch.

NOTE 1: DeltaFetch only avoids sending requests to pages that have generated scraped items before, and only if these requests were not generated from the spider’s start_urls or start_requests. Pages from where no items were directly scraped will still be crawled every time you run your spiders, so DeltaFetch addon is great for detecting new records in directories.

Getting Started with DeltaFetch

Enable it in your project’s settings.py file:

SPIDER_MIDDLEWARES = {
    'scrapy_deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True

To use DeltaFetch in Scrapy Cloud you'll also need to enable scrapy-dotpersistence extension in your project's settings.py:

EXTENSIONS = {
    ...
    'scrapy_dotpersistence.DotScrapyPersistence': 0
}

and

DOTSCRAPY_ENABLED = True

This way DotScrapy Persistence will create a .db in your bucket with this format:

s3://my_bucket/username/org-<orgid>/<projectid>/dot-scrapy/<spidername>/deltafetch/<spidername>.db

Where:

my_bucket is configured via the ADDONS_S3_BUCKET setting
<orgid> is your Zyte organization id
<spidername> is the name of the spider

Configuring DotScrapy Persistence

You can use your own S3 bucket by adding these settings:

ADDONS_AWS_ACCESS_KEY_ID = 'ABC'
ADDONS_AWS_SECRET_ACCESS_KEY = 'DEF'
ADDONS_AWS_USERNAME = 'username' // This is the folder path (optional)
ADDONS_S3_BUCKET = 'my_bucket'

NOTE 2: the settings have a Prefix "ADDONS", not to be confused with AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY.

NOTE 3: When adding the settings through Zyte, please set them on Spider level. Setting it on Project level or in settings.py won't work because Zyte's default settings are propagated on Organization level and have higher priority, but lower than Spider level settings.

Resetting DeltaFetch

If you want to re-scrape pages, you can reset the DeltaFetch cache by adding the following setting when running a job:

DELTAFETCH_RESET = 1 (or True)

Make sure to disable it for the following runs.

For more details on the middleware, you can check the github repository: scrapy-dotpersistence.