Storing Images - Only option is S3?

M

Max

started a topic over 6 years ago

Hi All,

Images are stored in the local directory for my project.

Can these files be downloaded along with the items I scraped?

Thanks

Best Answer

g

grajagopalan said over 6 years ago

Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API.

vaz

said over 6 years ago

Hey MxLei,

I'm not sure to understand correctly what you expect, could you explain a bit more? Thanks.

Pablo

g

grajagopalan

said over 6 years ago

Answer

Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API.

M

Max

said over 6 years ago

I'm not sure if you can see this, but https://app.scrapinghub.com/p/230228/2/1/items

Although I see images field being populated by image information such as hash, but I don't see images in the dashboard or api?

S

Sergey Glebov

said over 6 years ago

Hi everyone,

I have the same question too.

So, images scrap as 3 fields - url, path, checksum, but question is how to download the image binary?

I suspect that "path" needs to be appended to something, but failed to construct it myself.

Thanks!

a

akochanowicz

said over 6 years ago

@frasl,

Your full S3 path would need to be appended to the "path" returned in your items. So something like:

url = "https://s3.amazonaws.com/your-bucket/your-folder/" + path

What I would like to know is if there is a way to construct this programmatically in the scraper code itself. It seems like the "images" object is not available until after I've uploaded my code to ScrapingHub?

tom

said over 6 years ago

If you want the actual image then you would need to add an image pipeline. See https://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline for more info on doing this, there are a few examples there.

a

akochanowicz

said over 6 years ago

Tom,

Thanks for the quick reply. I did see this doc, but I don't think it covers what I'm looking for. I'm not looking for the actual image data, but rather a clean S3 path which can be used immediately after the scrape. For example, my current code returns a scrapy.Field that contains this dictionary:

{"url":"http://www.locatoronline.com/photos/fullsize/386978_1.jpg","path":"full/1c11194780b4d4ef06cf5381a285f4045944f1b1.jpg","checksum":"17341778f9acfb7b5b384fe1bc532e49"}

I'ld like to create another scrapyField that appends my base S3 URL to "path" . Currently I have to do this in python after the scrape is complete. So, for this example I'd have another field, say imagePath, that contains:

https://s3.amazonaws.com/my-bucket/my-folder/full/1c11194780b4d4ef06cf5381a285f4045944f1b1.jpg

tom

said over 6 years ago

ok I see. Your best bet is either to use the Feed Exports like https://doc.scrapy.org/en/latest/topics/feed-exports.html. I think that is what you are looking for,

Another option is to use the boto library to directly upload these items to s3. You can use a itempipeline to do that. Boto docs are http://boto3.readthedocs.io/en/latest/

a

akochanowicz

said over 6 years ago

Yep, this looks about right. Cool, thanks!

tom

said over 6 years ago

Great, some examples of doing this too for instance is http://www.scrapingauthority.com/2016/09/19/scrapy-exporting-json-and-csv/

Other relevant article is https://support.scrapinghub.com/solution/articles/22000200447-exporting-scraped-items-to-an-aws-s3-account-ui-mode-

S

Sergey Glebov

said over 6 years ago

Hello again,

Ok, sorry guys, for messing the topic with 2 slightly different problems, I thought akochanowicz has the same issue :-)

so, what if there's no Amazon S3, just file storage:

- What base path to set for the image pipeline if I use scrapinghub? Is it possible to use scrapinghub filesystem for the images? Maybe, restrictions for free accounts?

- How do I download images afterwards? I have set path as

IMAGES_STORE = 'images' , and it looks like images get successfully downloaded, I receive entries like:

url	https://content.onliner.by/automarket/1931405/original/bd52b6d32c504cceca65fc011eca07bb.jpeg
path	full/0560893e824ad2875ce18b53c4c09003831e51be.jpg
checksum	07e0507aeaeddb9360b96ffc0234424c

So how do I fetch such an image?

Thanks!

tom

said over 6 years ago

Well you would use feedexports for that. You can also publish this as a dataset to get the images as well. S3 is pretty cheap, you can use 5 gigs for free which may cover your needs. There are other options such as sending the images to a db of your choice, even if it is locally on your computer as long as you provide the correct info, just make sure it is secure.

Zyte Support Center

How can we help you today?