Start a new topic
Answered

Finding files downloaded by spider in the UI

 Hi there,


When I run my spiders locally, they download JSON files from some API endpoints and save them to disc (using the Files pipeline component). When I run them in the scrapycloud, I can see each item with the URL of the file, and the file path set for it, but nowhere can I find the contents of the file. All the options I see are to download a dump of the metadata for each item.


Thanks!


Best Answer

There's no write access to Scrapy Cloud. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS.


Answer

There's no write access to Scrapy Cloud. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS.

Thanks for replying nestor. When I use the Files pipeline it seems to be downloading the files successfully. Or at least, the metadata implies it is doing, and the rest of my script can read them and use them.. wherever they are. So there must be somewhere to retrieve them from..

Could you share a job ID?

Well, you do have access to /scrapinghub and /tmp folders, but it gets cleared after job run, so it would still make sense to export to an external storage anyways.

Once the file is downloaded, it's then read back from disc and posted to an API endpoint. The other end is receiving it. I could share a job ID but how would that give you access to anything useful?

Read my last comment, you have access to write to /scrapinghub and /tmp folders which is why you are able to use them. But once the job ends, the container (Scrapy Cloud unit) gets wiped so you need to export it somewhere else before the job ends, so use the built-in support for S3 or GCS.

I sent my reply before I saw the edit of your last comment to include that information.

I see. Let me know if you have further questions or if there's anything else I can assist you with.

Login to post a comment