Storing Images - Only option is S3?

Posted over 7 years ago by Max

Post a topic

Answered

Max

Hi All,

Images are stored in the local directory for my project.

Can these files be downloaded along with the items I scraped?

Thanks

0 Votes

grajagopalan posted over 7 years ago Admin Best Answer

Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API.

0 Votes

12 Comments

tom posted about 7 years ago Admin

Well you would use feedexports for that. You can also publish this as a dataset to get the images as well. S3 is pretty cheap, you can use 5 gigs for free which may cover your needs. There are other options such as sending the images to a db of your choice, even if it is locally on your computer as long as you provide the correct info, just make sure it is secure.

1 Votes

Sergey Glebov posted about 7 years ago

Hello again,

Ok, sorry guys, for messing the topic with 2 slightly different problems, I thought akochanowicz has the same issue :-)

so, what if there's no Amazon S3, just file storage:

- What base path to set for the image pipeline if I use scrapinghub? Is it possible to use scrapinghub filesystem for the images? Maybe, restrictions for free accounts?

- How do I download images afterwards? I have set path as

IMAGES_STORE = 'images' , and it looks like images get successfully downloaded, I receive entries like:

url	https://content.onliner.by/automarket/1931405/original/bd52b6d32c504cceca65fc011eca07bb.jpeg
path	full/0560893e824ad2875ce18b53c4c09003831e51be.jpg
checksum	07e0507aeaeddb9360b96ffc0234424c

So how do I fetch such an image?

Thanks!

0 Votes

tom posted about 7 years ago Admin

Great, some examples of doing this too for instance is http://www.scrapingauthority.com/2016/09/19/scrapy-exporting-json-and-csv/

Other relevant article is https://support.scrapinghub.com/solution/articles/22000200447-exporting-scraped-items-to-an-aws-s3-account-ui-mode-

0 Votes

akochanowicz posted about 7 years ago

Yep, this looks about right. Cool, thanks!

0 Votes

tom posted about 7 years ago Admin

ok I see. Your best bet is either to use the Feed Exports like https://doc.scrapy.org/en/latest/topics/feed-exports.html. I think that is what you are looking for,

Another option is to use the boto library to directly upload these items to s3. You can use a itempipeline to do that. Boto docs are http://boto3.readthedocs.io/en/latest/

1 Votes

akochanowicz posted about 7 years ago

Tom,

Thanks for the quick reply. I did see this doc, but I don't think it covers what I'm looking for. I'm not looking for the actual image data, but rather a clean S3 path which can be used immediately after the scrape. For example, my current code returns a scrapy.Field that contains this dictionary:

{"url":"http://www.locatoronline.com/photos/fullsize/386978_1.jpg","path":"full/1c11194780b4d4ef06cf5381a285f4045944f1b1.jpg","checksum":"17341778f9acfb7b5b384fe1bc532e49"}

I'ld like to create another scrapyField that appends my base S3 URL to "path" . Currently I have to do this in python after the scrape is complete. So, for this example I'd have another field, say imagePath, that contains:

https://s3.amazonaws.com/my-bucket/my-folder/full/1c11194780b4d4ef06cf5381a285f4045944f1b1.jpg

0 Votes

tom posted about 7 years ago Admin

If you want the actual image then you would need to add an image pipeline. See https://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline for more info on doing this, there are a few examples there.

0 Votes

akochanowicz posted about 7 years ago

@frasl,

Your full S3 path would need to be appended to the "path" returned in your items. So something like:

url = "https://s3.amazonaws.com/your-bucket/your-folder/" + path

What I would like to know is if there is a way to construct this programmatically in the scraper code itself. It seems like the "images" object is not available until after I've uploaded my code to ScrapingHub?

0 Votes

Sergey Glebov posted about 7 years ago

Hi everyone,

I have the same question too.

So, images scrap as 3 fields - url, path, checksum, but question is how to download the image binary?

I suspect that "path" needs to be appended to something, but failed to construct it myself.

Thanks!

0 Votes

Max posted about 7 years ago

I'm not sure if you can see this, but https://app.scrapinghub.com/p/230228/2/1/items

Although I see images field being populated by image information such as hash, but I don't see images in the dashboard or api?

0 Votes

grajagopalan posted over 7 years ago Admin Answer

Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API.

0 Votes

vaz posted over 7 years ago

Hey MxLei,

I'm not sure to understand correctly what you expect, could you explain a bit more? Thanks.

Pablo

0 Votes