What I would like to know is if there is a way to construct this programmatically in the scraper code itself. It seems like the "images" object is not available until after I've uploaded my code to ScrapingHub?
Thanks for the quick reply. I did see this doc, but I don't think it covers what I'm looking for. I'm not looking for the actual image data, but rather a clean S3 path which can be used immediately after the scrape. For example, my current code returns a scrapy.Field that contains this dictionary:
I'ld like to create another scrapyField that appends my base S3 URL to "path" . Currently I have to do this in python after the scrape is complete. So, for this example I'd have another field, say imagePath, that contains:
Another option is to use the boto library to directly upload these items to s3. You can use a itempipeline to do that. Boto docs are http://boto3.readthedocs.io/en/latest/
Ok, sorry guys, for messing the topic with 2 slightly different problems, I thought akochanowicz has the same issue :-)
so, what if there's no Amazon S3, just file storage:
- What base path to set for the image pipeline if I use scrapinghub? Is it possible to use scrapinghub filesystem for the images? Maybe, restrictions for free accounts?
- How do I download images afterwards? I have set path as
IMAGES_STORE = 'images' , and it looks like images get successfully downloaded, I receive entries like:
Well you would use feedexports for that. You can also publish this as a dataset to get the images as well. S3 is pretty cheap, you can use 5 gigs for free which may cover your needs. There are other options such as sending the images to a db of your choice, even if it is locally on your computer as long as you provide the correct info, just make sure it is secure.
Hi All,
Images are stored in the local directory for my project.
Can these files be downloaded along with the items I scraped?
Thanks
0 Votes
grajagopalan posted about 7 years ago Admin Best Answer
Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API.
0 Votes
12 Comments
vaz posted about 7 years ago
Hey MxLei,
I'm not sure to understand correctly what you expect, could you explain a bit more? Thanks.
Pablo
0 Votes
grajagopalan posted about 7 years ago Admin Answer
Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API.
0 Votes
Max posted about 7 years ago
I'm not sure if you can see this, but https://app.scrapinghub.com/p/230228/2/1/items
Although I see images field being populated by image information such as hash, but I don't see images in the dashboard or api?
0 Votes
Sergey Glebov posted almost 7 years ago
Hi everyone,
I have the same question too.
So, images scrap as 3 fields - url, path, checksum, but question is how to download the image binary?
I suspect that "path" needs to be appended to something, but failed to construct it myself.
Thanks!
0 Votes
akochanowicz posted almost 7 years ago
@frasl,
Your full S3 path would need to be appended to the "path" returned in your items. So something like:
What I would like to know is if there is a way to construct this programmatically in the scraper code itself. It seems like the "images" object is not available until after I've uploaded my code to ScrapingHub?
0 Votes
tom posted almost 7 years ago Admin
If you want the actual image then you would need to add an image pipeline. See https://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline for more info on doing this, there are a few examples there.
0 Votes
akochanowicz posted almost 7 years ago
Tom,
Thanks for the quick reply. I did see this doc, but I don't think it covers what I'm looking for. I'm not looking for the actual image data, but rather a clean S3 path which can be used immediately after the scrape. For example, my current code returns a scrapy.Field that contains this dictionary:
I'ld like to create another scrapyField that appends my base S3 URL to "path" . Currently I have to do this in python after the scrape is complete. So, for this example I'd have another field, say imagePath, that contains:
0 Votes
tom posted almost 7 years ago Admin
ok I see. Your best bet is either to use the Feed Exports like https://doc.scrapy.org/en/latest/topics/feed-exports.html. I think that is what you are looking for,
Another option is to use the boto library to directly upload these items to s3. You can use a itempipeline to do that. Boto docs are http://boto3.readthedocs.io/en/latest/
1 Votes
akochanowicz posted almost 7 years ago
Yep, this looks about right. Cool, thanks!
0 Votes
tom posted almost 7 years ago Admin
Great, some examples of doing this too for instance is http://www.scrapingauthority.com/2016/09/19/scrapy-exporting-json-and-csv/
Other relevant article is https://support.scrapinghub.com/solution/articles/22000200447-exporting-scraped-items-to-an-aws-s3-account-ui-mode-
0 Votes
Sergey Glebov posted almost 7 years ago
Hello again,
Ok, sorry guys, for messing the topic with 2 slightly different problems, I thought akochanowicz has the same issue :-)
so, what if there's no Amazon S3, just file storage:
- What base path to set for the image pipeline if I use scrapinghub? Is it possible to use scrapinghub filesystem for the images? Maybe, restrictions for free accounts?
- How do I download images afterwards? I have set path as
IMAGES_STORE = 'images' , and it looks like images get successfully downloaded, I receive entries like:
https://content.onliner.by/automarket/1931405/original/bd52b6d32c504cceca65fc011eca07bb.jpeg
full/0560893e824ad2875ce18b53c4c09003831e51be.jpg
07e0507aeaeddb9360b96ffc0234424c
So how do I fetch such an image?
Thanks!
0 Votes
tom posted almost 7 years ago Admin
Well you would use feedexports for that. You can also publish this as a dataset to get the images as well. S3 is pretty cheap, you can use 5 gigs for free which may cover your needs. There are other options such as sending the images to a db of your choice, even if it is locally on your computer as long as you provide the correct info, just make sure it is secure.
1 Votes
Login to post a comment