Start a new topic
Answered

How can i save the data in scrapyinghub?

 The `stockInfo.py` contains:

    import scrapy
    import re
    import pkgutil
    
    class QuotesSpider(scrapy.Spider):
        name = "stockInfo"
        data = pkgutil.get_data("tutorial", "resources/urls.txt")
        data = data.decode()
        start_urls = data.split("\r\n")
    
        def parse(self, response):
            company = re.findall("[0-9]{6}",response.url)[0]
            filename = '%s_info.html' % company
            with open(filename, 'wb') as f:
                f.write(response.body)

To execute the spider `stockInfo` in window's cmd.

    d:
    cd  tutorial
    scrapy crawl stockInfo

Now all webpage of the url in `resources/urls.txt` will downloaded on the local pc's directory `d:/tutorial`.

Then to deploy the spider into `Scrapinghub`,and run `stockInfo spider`.

No error occur,where is the downloaded webpage?     
How the following command lines executed in `Scrapinghub`?    
 
            with open(filename, 'wb') as f:
                f.write(response.body)


How can i save the data  in scrapinghub,and download it from scrapinghub when job is finished?




Best Answer

There's no write access to Scrapy Cloud.  You do have access to /scrapinghub and /tmp folders, but it gets cleared after job run. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS using Feed Export as given in https://docs.scrapy.org/en/latest/topics/feed-exports.html#storages and https://docs.scrapy.org/en/latest/topics/media-pipeline.html?highlight=gcs#supported-storage.


1 Comment

Answer

There's no write access to Scrapy Cloud.  You do have access to /scrapinghub and /tmp folders, but it gets cleared after job run. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS using Feed Export as given in https://docs.scrapy.org/en/latest/topics/feed-exports.html#storages and https://docs.scrapy.org/en/latest/topics/media-pipeline.html?highlight=gcs#supported-storage.


Login to post a comment