Downloading and processing images

Modified on Wed, 3 Feb, 2021 at 7:55 AM

NOT TO BE CONFUSED WITH THE IMAGES ADDON

Scrapy provides reusable item pipelines for downloading images attached to a particular item (for example, when you scrape products and also want to download their images).

The Images Pipeline has the following functions for processing images:

Avoid re-downloading media that was downloaded recently
Specifying where to store the media (Amazon S3 bucket, Google Cloud Storage bucket)
Convert all downloaded images to a common format (JPG) and mode (RGB)
Thumbnail generation
Check images width/height to make sure they meet a minimum constrain

Enabling Images Pipeline

To enable the Images pipeline you must first add it to your project ITEM_PIPELINES setting:

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}

Then, configure the target storage setting IMAGES_STORE to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting. The Image will be downloaded and stored in the following format:

<IMAGES_STORE>/full/<image_id>.jpg

Where:

<image_id> is the SHA1 hash of the image url

Supported Storages in Scrapy Cloud

Amazon S3 Storage:

IMAGES_STORE = 's3://bucket/images'

You will also need to provide AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your settings.py

You can modify the Access Control List (ACL) policy used for the stored files, which is defined by the IMAGES_STORE_S3_ACL setting. By default, the ACL is set to private. To make the files publicly available use the public-read policy:

IMAGES_STORE_S3_ACL = 'public-read'

Google Cloud Storage (requires google-cloud-storage )

IMAGES_STORE = 'gs://bucket/images/'
GCS_PROJECT_ID = 'project_id'

NOTE: Support is available only on Scrapy 1.5.0+

For information about authentication, see this documentation.

Using the Images Pipeline

The Images Pipeline will download images from extracted image URLs and store them into the selected storage. For the Images Pipeline, you will need to define two item fields:

image_urls - which is used for annotating image URLs in the template. This will be the source field from which the Images Pipeline will get URLs of the images to be downloaded.
images - which will save important information about the stored image, including storage path relative to the IMAGES_STORE setting and the original image URL.

Those field names are the default ones, but can be overridden with the settings IMAGES_URLS_FIELD and IMAGES_RESULT_FIELD. The source and target fields defined by these two settings do not need to be different – they can have the same name. It will ease you from defining an additional field in the item. The Images Pipeline will just overwrite the data previously extracted with the data it generates (which is a dict already including the origin URL).

Configuring the Images Pipeline

File expiration

The Image Pipeline avoids downloading files that were downloaded recently. To adjust this retention delay use the IMAGES_EXPIRES setting which specifies the delay in number of days:

IMAGES_EXPIRES = 30

The default value is 90 days.

Thumbnail generation for images

The Images Pipeline can automatically create thumbnails of the downloaded images.

In order use this feature, you must set IMAGES_THUMBS to a dictionary where the keys are the thumbnail names and the values are their dimensions.

IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}

When you use this feature, the Images Pipeline will create thumbnails of the each specified size with this format:

<IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg

Where:

<size_name> is the one specified in the IMAGES_THUMBS dictionary keys (small, big, etc)
<image_id> is the SHA1 hash of the image url

Filtering out small images

The Images Pipeline can drop images which are too small by specifying the minimum allowed size in the IMAGES_MIN_HEIGHT and IMAGES_MIN_WIDTH settings.

IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

It is possible to set just one size constraint or both. When setting both of them, only images that satisfy both minimum sizes will be saved. By default, there are no size constraints, so all images are processed.

Allowing redirections

By default media pipelines ignore redirects, i.e. an HTTP redirection to a media file URL request will mean the media download is considered failed.

To handle media redirections, set this setting to True:

MEDIA_ALLOW_REDIRECTS = True