class QuotesSpider(scrapy.Spider): name = "stockInfo" data = pkgutil.get_data("tutorial", "resources/urls.txt") data = data.decode() start_urls = data.split("\r\n")
def parse(self, response): company = re.findall("[0-9]{6}",response.url)[0] filename = '%s_info.html' % company with open(filename, 'wb') as f: f.write(response.body)
To execute the spider `stockInfo` in window's cmd.
d: cd tutorial scrapy crawl stockInfo
Now all webpage of the url in `resources/urls.txt` will downloaded on the local pc's directory `d:/tutorial`.
Then to deploy the spider into `Scrapinghub`,and run `stockInfo spider`.
No error occur,where is the downloaded webpage? How the following command lines executed in `Scrapinghub`?
with open(filename, 'wb') as f: f.write(response.body)
How can i save the data in scrapinghub,and download it from scrapinghub when job is finished?
hwypengsir
The `stockInfo.py` contains:
import scrapy
import re
import pkgutil
class QuotesSpider(scrapy.Spider):
name = "stockInfo"
data = pkgutil.get_data("tutorial", "resources/urls.txt")
data = data.decode()
start_urls = data.split("\r\n")
def parse(self, response):
company = re.findall("[0-9]{6}",response.url)[0]
filename = '%s_info.html' % company
with open(filename, 'wb') as f:
f.write(response.body)
To execute the spider `stockInfo` in window's cmd.
d:
cd tutorial
scrapy crawl stockInfo
Now all webpage of the url in `resources/urls.txt` will downloaded on the local pc's directory `d:/tutorial`.
Then to deploy the spider into `Scrapinghub`,and run `stockInfo spider`.
No error occur,where is the downloaded webpage?
How the following command lines executed in `Scrapinghub`?
with open(filename, 'wb') as f:
f.write(response.body)
How can i save the data in scrapinghub,and download it from scrapinghub when job is finished?
There's no write access to Scrapy Cloud. You do have access to /scrapinghub and /tmp folders, but it gets cleared after job run. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS using Feed Export as given in https://docs.scrapy.org/en/latest/topics/feed-exports.html#storages and https://docs.scrapy.org/en/latest/topics/media-pipeline.html?highlight=gcs#supported-storage.
thriveni
There's no write access to Scrapy Cloud. You do have access to /scrapinghub and /tmp folders, but it gets cleared after job run. Instead, you'll need to use the alternative supported file storage provided by Files pipeline, S3 or GCS using Feed Export as given in https://docs.scrapy.org/en/latest/topics/feed-exports.html#storages and https://docs.scrapy.org/en/latest/topics/media-pipeline.html?highlight=gcs#supported-storage.
-
Unable to select Scrapy project in GitHub
-
ScrapyCloud can't call spider?
-
Unhandled error in Deferred
-
Item API - Filtering
-
newbie to web scraping but need data from zillow
-
ValueError: Invalid control character
-
Cancelling account
-
Best Practices
-
Beautifulsoup with ScrapingHub
-
Delete a project in ScrapingHub
See all 446 topics