NOT TO BE CONFUSED WITH THE IMAGES ADDON
Scrapy provides reusable item pipelines for downloading images attached to a particular item (for example, when you scrape products and also want to download their images).
The Images Pipeline has the following functions for processing images:
- Avoid re-downloading media that was downloaded recently
- Specifying where to store the media (Amazon S3 bucket, Google Cloud Storage bucket)
- Convert all downloaded images to a common format (JPG) and mode (RGB)
- Thumbnail generation
- Check images width/height to make sure they meet a minimum constrain
Enabling Images Pipeline
To enable the Images pipeline you must first add it to your project ITEM_PIPELINES setting:
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
Then, configure the target storage setting IMAGES_STORE to a valid value that will be used for storing the downloaded images. Otherwise the pipeline will remain disabled, even if you include it in the ITEM_PIPELINES setting. The Image will be downloaded and stored in the following format:
<IMAGES_STORE>/full/<image_id>.jpg
Where:
<image_id>
is the SHA1 hash of the image url
Supported Storages in Scrapy Cloud
Amazon S3 Storage:
IMAGES_STORE = 's3://bucket/images'
You will also need to provide AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your settings.py
You can modify the Access Control List (ACL) policy used for the stored files, which is defined by the IMAGES_STORE_S3_ACL setting. By default, the ACL is set to private
. To make the files publicly available use the public-read
policy:
IMAGES_STORE_S3_ACL = 'public-read'
Google Cloud Storage (requires google-cloud-storage )
IMAGES_STORE = 'gs://bucket/images/' GCS_PROJECT_ID = 'project_id'
NOTE: Support is available only on Scrapy 1.5.0+
For information about authentication, see this documentation.
Using the Images Pipeline
The Images Pipeline will download images from extracted image URLs and store them into the selected storage. For the Images Pipeline, you will need to define two item fields:
image_urls
- which is used for annotating image URLs in the template. This will be the source field from which the Images Pipeline will get URLs of the images to be downloaded.images -
which will save important information about the stored image, including storage path relative to theIMAGES_STORE
setting and the original image URL.
Those field names are the default ones, but can be overridden with the settings IMAGES_URLS_FIELD and IMAGES_RESULT_FIELD. The source and target fields defined by these two settings do not need to be different – they can have the same name. It will ease you from defining an additional field in the item. The Images Pipeline will just overwrite the data previously extracted with the data it generates (which is a dict already including the origin URL).
Configuring the Images Pipeline
File expiration
The Image Pipeline avoids downloading files that were downloaded recently. To adjust this retention delay use the IMAGES_EXPIRES setting which specifies the delay in number of days:
IMAGES_EXPIRES = 30
The default value is 90 days.
Thumbnail generation for images
The Images Pipeline can automatically create thumbnails of the downloaded images.
In order use this feature, you must set IMAGES_THUMBS to a dictionary where the keys are the thumbnail names and the values are their dimensions.
IMAGES_THUMBS = { 'small': (50, 50), 'big': (270, 270), }
When you use this feature, the Images Pipeline will create thumbnails of the each specified size with this format:
<IMAGES_STORE>/thumbs/<size_name>/<image_id>.jpg
Where:
<size_name>
is the one specified in theIMAGES_THUMBS
dictionary keys (small
,big
, etc)<image_id>
is the SHA1 hash of the image url
Filtering out small images
The Images Pipeline can drop images which are too small by specifying the minimum allowed size in the IMAGES_MIN_HEIGHT and IMAGES_MIN_WIDTH settings.
IMAGES_MIN_HEIGHT = 110 IMAGES_MIN_WIDTH = 110
It is possible to set just one size constraint or both. When setting both of them, only images that satisfy both minimum sizes will be saved. By default, there are no size constraints, so all images are processed.
Allowing redirections
By default media pipelines ignore redirects, i.e. an HTTP redirection to a media file URL request will mean the media download is considered failed.
To handle media redirections, set this setting to True:
MEDIA_ALLOW_REDIRECTS = True
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article