Well you would use feedexports for that. You can also publish this as a dataset to get the images as well. S3 is pretty cheap, you can use 5 gigs for free which may cover your needs. There are other options such as sending the images to a db of your choice, even if it is locally on your computer as long as you provide the correct info, just make sure it is secure.
Another option is to use the boto library to directly upload these items to s3. You can use a itempipeline to do that. Boto docs are http://boto3.readthedocs.io/en/latest/
1 person likes this
S
Sergey Glebov
said
about 6 years ago
Hello again,
Ok, sorry guys, for messing the topic with 2 slightly different problems, I thought akochanowicz has the same issue :-)
so, what if there's no Amazon S3, just file storage:
- What base path to set for the image pipeline if I use scrapinghub? Is it possible to use scrapinghub filesystem for the images? Maybe, restrictions for free accounts?
- How do I download images afterwards? I have set path as
IMAGES_STORE = 'images' , and it looks like images get successfully downloaded, I receive entries like:
What I would like to know is if there is a way to construct this programmatically in the scraper code itself. It seems like the "images" object is not available until after I've uploaded my code to ScrapingHub?
Thanks for the quick reply. I did see this doc, but I don't think it covers what I'm looking for. I'm not looking for the actual image data, but rather a clean S3 path which can be used immediately after the scrape. For example, my current code returns a scrapy.Field that contains this dictionary:
I'ld like to create another scrapyField that appends my base S3 URL to "path" . Currently I have to do this in python after the scrape is complete. So, for this example I'd have another field, say imagePath, that contains:
Max
Hi All,
Images are stored in the local directory for my project.
Can these files be downloaded along with the items I scraped?
Thanks
Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API.
- Oldest First
- Popular
- Newest First
Sorted by Populartom
Well you would use feedexports for that. You can also publish this as a dataset to get the images as well. S3 is pretty cheap, you can use 5 gigs for free which may cover your needs. There are other options such as sending the images to a db of your choice, even if it is locally on your computer as long as you provide the correct info, just make sure it is secure.
1 person likes this
tom
ok I see. Your best bet is either to use the Feed Exports like https://doc.scrapy.org/en/latest/topics/feed-exports.html. I think that is what you are looking for,
Another option is to use the boto library to directly upload these items to s3. You can use a itempipeline to do that. Boto docs are http://boto3.readthedocs.io/en/latest/
1 person likes this
Sergey Glebov
Hello again,
Ok, sorry guys, for messing the topic with 2 slightly different problems, I thought akochanowicz has the same issue :-)
so, what if there's no Amazon S3, just file storage:
- What base path to set for the image pipeline if I use scrapinghub? Is it possible to use scrapinghub filesystem for the images? Maybe, restrictions for free accounts?
- How do I download images afterwards? I have set path as
IMAGES_STORE = 'images' , and it looks like images get successfully downloaded, I receive entries like:
https://content.onliner.by/automarket/1931405/original/bd52b6d32c504cceca65fc011eca07bb.jpeg
full/0560893e824ad2875ce18b53c4c09003831e51be.jpg
07e0507aeaeddb9360b96ffc0234424c
So how do I fetch such an image?
Thanks!
vaz
Hey MxLei,
I'm not sure to understand correctly what you expect, could you explain a bit more? Thanks.
Pablo
Max
I'm not sure if you can see this, but https://app.scrapinghub.com/p/230228/2/1/items
Although I see images field being populated by image information such as hash, but I don't see images in the dashboard or api?
Sergey Glebov
Hi everyone,
I have the same question too.
So, images scrap as 3 fields - url, path, checksum, but question is how to download the image binary?
I suspect that "path" needs to be appended to something, but failed to construct it myself.
Thanks!
akochanowicz
@frasl,
Your full S3 path would need to be appended to the "path" returned in your items. So something like:
What I would like to know is if there is a way to construct this programmatically in the scraper code itself. It seems like the "images" object is not available until after I've uploaded my code to ScrapingHub?
tom
If you want the actual image then you would need to add an image pipeline. See https://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline for more info on doing this, there are a few examples there.
akochanowicz
Tom,
Thanks for the quick reply. I did see this doc, but I don't think it covers what I'm looking for. I'm not looking for the actual image data, but rather a clean S3 path which can be used immediately after the scrape. For example, my current code returns a scrapy.Field that contains this dictionary:
I'ld like to create another scrapyField that appends my base S3 URL to "path" . Currently I have to do this in python after the scrape is complete. So, for this example I'd have another field, say imagePath, that contains:
akochanowicz
Yep, this looks about right. Cool, thanks!
tom
Great, some examples of doing this too for instance is http://www.scrapingauthority.com/2016/09/19/scrapy-exporting-json-and-csv/
Other relevant article is https://support.scrapinghub.com/solution/articles/22000200447-exporting-scraped-items-to-an-aws-s3-account-ui-mode-
grajagopalan
Images scraped by the spider in Scrapy Cloud will be available as items. They can be downloaded from the Dashboard or API.
-
Unable to select Scrapy project in GitHub
-
ScrapyCloud can't call spider?
-
Unhandled error in Deferred
-
Item API - Filtering
-
newbie to web scraping but need data from zillow
-
ValueError: Invalid control character
-
Cancelling account
-
Best Practices
-
Beautifulsoup with ScrapingHub
-
Delete a project in ScrapingHub
See all 446 topics