Storing Images - Only option is S3?

Posted about 7 years ago by Max

Post a topic
Answered
M
Max

Hi All,


Images are stored in the local directory for my project.


Can these files be downloaded along with the items I scraped?


Thanks

0 Votes

g

grajagopalan posted about 7 years ago Admin Best Answer

Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API. 

0 Votes


12 Comments

Sorted by
vaz

vaz posted about 7 years ago

Hey MxLei,


I'm not sure to understand correctly what you expect, could you explain a bit more? Thanks.


Pablo

0 Votes

g

grajagopalan posted about 7 years ago Admin Answer

Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API. 

0 Votes

M

Max posted about 7 years ago

I'm not sure if you can see this, but https://app.scrapinghub.com/p/230228/2/1/items

Although I see images field being populated by image information such as hash, but I don't see images in the dashboard or api?

0 Votes

S

Sergey Glebov posted almost 7 years ago

Hi everyone,

I have the same question too. 

So, images scrap as 3 fields - url, path, checksum, but question is how to download the image binary? 

I suspect that "path" needs to be appended to something, but failed to construct it myself.


Thanks!


0 Votes

a

akochanowicz posted almost 7 years ago

@frasl,


Your full S3 path would need to be appended to the "path" returned in your items.  So something like:

 

url = "https://s3.amazonaws.com/your-bucket/your-folder/" + path

What I would like to know is if there is a way to construct this programmatically in the scraper code itself.  It seems like the "images" object is not available until after I've uploaded my code to ScrapingHub?  

0 Votes

tom

tom posted almost 7 years ago Admin

If you want the actual image then you would need to add an image pipeline. See https://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline for more info on doing this, there are a few examples there.

0 Votes

a

akochanowicz posted almost 7 years ago

Tom,


Thanks for the quick reply.  I did see this doc, but I don't think it covers what I'm looking for.  I'm not looking for the actual image data, but rather a clean S3 path which can be used immediately after the scrape.  For example, my current code returns a scrapy.Field that contains this dictionary:

{"url":"http://www.locatoronline.com/photos/fullsize/386978_1.jpg","path":"full/1c11194780b4d4ef06cf5381a285f4045944f1b1.jpg","checksum":"17341778f9acfb7b5b384fe1bc532e49"}

I'ld like to create another scrapyField that appends my base S3 URL to "path" . Currently I have to do this in python after the scrape is complete.  So, for this example I'd have another field, say imagePath, that contains:   

https://s3.amazonaws.com/my-bucket/my-folder/full/1c11194780b4d4ef06cf5381a285f4045944f1b1.jpg

 

0 Votes

tom

tom posted almost 7 years ago Admin

ok I see. Your best bet is either to use the Feed Exports like https://doc.scrapy.org/en/latest/topics/feed-exports.html. I think that is what you are looking for,


Another option is to use the boto library to directly upload these items to s3. You can use a itempipeline to do that. Boto docs are http://boto3.readthedocs.io/en/latest/

1 Votes

a

akochanowicz posted almost 7 years ago

Yep, this looks about right. Cool, thanks!

0 Votes

tom

tom posted almost 7 years ago Admin

0 Votes

S

Sergey Glebov posted almost 7 years ago

Hello again,


Ok, sorry guys, for messing the topic with 2 slightly different problems, I thought akochanowicz has the same issue :-)

so, what if there's no Amazon S3, just file storage:

- What base path to set for the image pipeline if I use scrapinghub? Is it possible to use scrapinghub filesystem for the images? Maybe, restrictions for free accounts? 

- How do I download images afterwards? I have set path as 

IMAGES_STORE = 'images' , and it looks like images get successfully downloaded, I receive entries like:


url

https://content.onliner.by/automarket/1931405/original/bd52b6d32c504cceca65fc011eca07bb.jpeg

path

full/0560893e824ad2875ce18b53c4c09003831e51be.jpg

checksum

07e0507aeaeddb9360b96ffc0234424c


So how do I fetch such an image? 


Thanks!


0 Votes

tom

tom posted almost 7 years ago Admin

Well you would use feedexports for that. You can also publish this as a dataset to get the images as well. S3 is pretty cheap, you can use 5 gigs for free which may cover your needs. There are other options such as sending the images to a db of your choice, even if it is locally on your computer as long as you provide the correct info, just make sure it is secure.

1 Votes

Login to post a comment