What file do you have? I think the easiest would be to simple attach a generator to your `start_requests` function. Something like the following for your spider class.
def start_requests(self):
with open('file.csv', r) as fh:
for line in fh.readlines():
yield scrapy.Request(
line, callback=self.parse_feed,
)
Please note I did not test it. To handle parallel processing I believe you should change your settings. Especially if you are going to do two million requests rapidly they might block you.
B
Billy John
said
about 5 years ago
I think the problem might be to deploy the project that large.
j
jwaterschoot
said
about 5 years ago
What do you mean? 2 million rows is big, but I think people crawl bigger sets of data. Try to add the CSV file as a requirement in your configuration and it should probably work. Maybe start with a subset of your start URLs.
Thanks Nestor. That is what I was worried about as well as deploy time.
I will use Google sheet as a source. Though it will read all 150 mb into memmory all at once which is not good.
Billy John
I have a large file (150mb) that has 2 million start urls. What whould be the best way to pass them to scrapy spider hosted on scrapinghub?
Limit size for resource file is 50 MB, you should probably try deploying as a custom image: https://support.scrapinghub.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud or download from an external source on spider start, like S3.
- Oldest First
- Popular
- Newest First
Sorted by Oldest Firstjwaterschoot
What file do you have? I think the easiest would be to simple attach a generator to your `start_requests` function. Something like the following for your spider class.
Please note I did not test it. To handle parallel processing I believe you should change your settings. Especially if you are going to do two million requests rapidly they might block you.
Billy John
jwaterschoot
What do you mean? 2 million rows is big, but I think people crawl bigger sets of data. Try to add the CSV file as a requirement in your configuration and it should probably work. Maybe start with a subset of your start URLs.
nestor
Limit size for resource file is 50 MB, you should probably try deploying as a custom image: https://support.scrapinghub.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud or download from an external source on spider start, like S3.
1 person likes this
Billy John
-
Unable to select Scrapy project in GitHub
-
ScrapyCloud can't call spider?
-
Unhandled error in Deferred
-
Item API - Filtering
-
newbie to web scraping but need data from zillow
-
ValueError: Invalid control character
-
Cancelling account
-
Best Practices
-
Beautifulsoup with ScrapingHub
-
Delete a project in ScrapingHub
See all 442 topics