When I run my spiders locally, they download JSON files from some API endpoints and save them to disc (using the Files pipeline component). When I run them in the scrapycloud, I can see each item with the URL of the file, and the file path set for it, but nowhere can I find the contents of the file. All the options I see are to download a dump of the metadata for each item.
Thanks!
Best Answer
n
nestor
said
over 5 years ago
There's no write access to Scrapy Cloud. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS.
There's no write access to Scrapy Cloud. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS.
r
rhiaro
said
over 5 years ago
Thanks for replying nestor. When I use the Files pipeline it seems to be downloading the files successfully. Or at least, the metadata implies it is doing, and the rest of my script can read them and use them.. wherever they are. So there must be somewhere to retrieve them from..
nestor
said
over 5 years ago
Could you share a job ID?
Well, you do have access to /scrapinghub and /tmp folders, but it gets cleared after job run, so it would still make sense to export to an external storage anyways.
r
rhiaro
said
over 5 years ago
Once the file is downloaded, it's then read back from disc and posted to an API endpoint. The other end is receiving it. I could share a job ID but how would that give you access to anything useful?
nestor
said
over 5 years ago
Read my last comment, you have access to write to /scrapinghub and /tmp folders which is why you are able to use them. But once the job ends, the container (Scrapy Cloud unit) gets wiped so you need to export it somewhere else before the job ends, so use the built-in support for S3 or GCS.
r
rhiaro
said
over 5 years ago
I sent my reply before I saw the edit of your last comment to include that information.
nestor
said
over 5 years ago
I see. Let me know if you have further questions or if there's anything else I can assist you with.
rhiaro
Hi there,
When I run my spiders locally, they download JSON files from some API endpoints and save them to disc (using the Files pipeline component). When I run them in the scrapycloud, I can see each item with the URL of the file, and the file path set for it, but nowhere can I find the contents of the file. All the options I see are to download a dump of the metadata for each item.
Thanks!
There's no write access to Scrapy Cloud. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS.
- Oldest First
- Popular
- Newest First
Sorted by Oldest Firstnestor
There's no write access to Scrapy Cloud. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS.
rhiaro
Thanks for replying nestor. When I use the Files pipeline it seems to be downloading the files successfully. Or at least, the metadata implies it is doing, and the rest of my script can read them and use them.. wherever they are. So there must be somewhere to retrieve them from..
nestor
Could you share a job ID?
Well, you do have access to /scrapinghub and /tmp folders, but it gets cleared after job run, so it would still make sense to export to an external storage anyways.
rhiaro
Once the file is downloaded, it's then read back from disc and posted to an API endpoint. The other end is receiving it. I could share a job ID but how would that give you access to anything useful?
nestor
Read my last comment, you have access to write to /scrapinghub and /tmp folders which is why you are able to use them. But once the job ends, the container (Scrapy Cloud unit) gets wiped so you need to export it somewhere else before the job ends, so use the built-in support for S3 or GCS.
rhiaro
I sent my reply before I saw the edit of your last comment to include that information.
nestor
I see. Let me know if you have further questions or if there's anything else I can assist you with.
-
Unable to select Scrapy project in GitHub
-
ScrapyCloud can't call spider?
-
Unhandled error in Deferred
-
Item API - Filtering
-
newbie to web scraping but need data from zillow
-
ValueError: Invalid control character
-
Cancelling account
-
Best Practices
-
Beautifulsoup with ScrapingHub
-
Delete a project in ScrapingHub
See all 458 topics