Start a new topic
Answered

Storing Images - Only option is S3?

Hi All,


Images are stored in the local directory for my project.


Can these files be downloaded along with the items I scraped?


Thanks


Best Answer

Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API. 


Hey MxLei,


I'm not sure to understand correctly what you expect, could you explain a bit more? Thanks.


Pablo

Answer

Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API. 

I'm not sure if you can see this, but https://app.scrapinghub.com/p/230228/2/1/items

Although I see images field being populated by image information such as hash, but I don't see images in the dashboard or api?

Hi everyone,

I have the same question too. 

So, images scrap as 3 fields - url, path, checksum, but question is how to download the image binary? 

I suspect that "path" needs to be appended to something, but failed to construct it myself.


Thanks!


@frasl,


Your full S3 path would need to be appended to the "path" returned in your items.  So something like:

 

url = "https://s3.amazonaws.com/your-bucket/your-folder/" + path

What I would like to know is if there is a way to construct this programmatically in the scraper code itself.  It seems like the "images" object is not available until after I've uploaded my code to ScrapingHub?  

If you want the actual image then you would need to add an image pipeline. See https://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline for more info on doing this, there are a few examples there.

Tom,


Thanks for the quick reply.  I did see this doc, but I don't think it covers what I'm looking for.  I'm not looking for the actual image data, but rather a clean S3 path which can be used immediately after the scrape.  For example, my current code returns a scrapy.Field that contains this dictionary:

{"url":"http://www.locatoronline.com/photos/fullsize/386978_1.jpg","path":"full/1c11194780b4d4ef06cf5381a285f4045944f1b1.jpg","checksum":"17341778f9acfb7b5b384fe1bc532e49"}

I'ld like to create another scrapyField that appends my base S3 URL to "path" . Currently I have to do this in python after the scrape is complete.  So, for this example I'd have another field, say imagePath, that contains:   

https://s3.amazonaws.com/my-bucket/my-folder/full/1c11194780b4d4ef06cf5381a285f4045944f1b1.jpg

 

ok I see. Your best bet is either to use the Feed Exports like https://doc.scrapy.org/en/latest/topics/feed-exports.html. I think that is what you are looking for,


Another option is to use the boto library to directly upload these items to s3. You can use a itempipeline to do that. Boto docs are http://boto3.readthedocs.io/en/latest/


1 person likes this

Yep, this looks about right. Cool, thanks!

Hello again,


Ok, sorry guys, for messing the topic with 2 slightly different problems, I thought akochanowicz has the same issue :-)

so, what if there's no Amazon S3, just file storage:

- What base path to set for the image pipeline if I use scrapinghub? Is it possible to use scrapinghub filesystem for the images? Maybe, restrictions for free accounts? 

- How do I download images afterwards? I have set path as 

IMAGES_STORE = 'images' , and it looks like images get successfully downloaded, I receive entries like:


url

https://content.onliner.by/automarket/1931405/original/bd52b6d32c504cceca65fc011eca07bb.jpeg

path

full/0560893e824ad2875ce18b53c4c09003831e51be.jpg

checksum

07e0507aeaeddb9360b96ffc0234424c


So how do I fetch such an image? 


Thanks!


Well you would use feedexports for that. You can also publish this as a dataset to get the images as well. S3 is pretty cheap, you can use 5 gigs for free which may cover your needs. There are other options such as sending the images to a db of your choice, even if it is locally on your computer as long as you provide the correct info, just make sure it is secure.


1 person likes this
Login to post a comment