Thanks Nestor. That is what I was worried about as well as deploy time.
I will use Google sheet as a source. Though it will read all 150 mb into memmory all at once which is not good.
What do you mean? 2 million rows is big, but I think people crawl bigger sets of data. Try to add the CSV file as a requirement in your configuration and it should probably work. Maybe start with a subset of your start URLs.
0 Votes
A
Artem Mposted
about 6 years ago
I think the problem might be to deploy the project that large.
0 Votes
j
jwaterschootposted
about 6 years ago
What file do you have? I think the easiest would be to simple attach a generator to your `start_requests` function. Something like the following for your spider class.
def start_requests(self):
with open('file.csv', r) as fh:
for line in fh.readlines():
yield scrapy.Request(
line, callback=self.parse_feed,
)
Please note I did not test it. To handle parallel processing I believe you should change your settings. Especially if you are going to do two million requests rapidly they might block you.
I have a large file (150mb) that has 2 million start urls. What whould be the best way to pass them to scrapy spider hosted on scrapinghub?
0 Votes
nestor posted about 6 years ago Admin Best Answer
Limit size for resource file is 50 MB, you should probably try deploying as a custom image: https://support.scrapinghub.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud or download from an external source on spider start, like S3.
1 Votes
5 Comments
Artem M posted about 6 years ago
0 Votes
nestor posted about 6 years ago Admin Answer
Limit size for resource file is 50 MB, you should probably try deploying as a custom image: https://support.scrapinghub.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud or download from an external source on spider start, like S3.
1 Votes
jwaterschoot posted about 6 years ago
What do you mean? 2 million rows is big, but I think people crawl bigger sets of data. Try to add the CSV file as a requirement in your configuration and it should probably work. Maybe start with a subset of your start URLs.
0 Votes
Artem M posted about 6 years ago
0 Votes
jwaterschoot posted about 6 years ago
What file do you have? I think the easiest would be to simple attach a generator to your `start_requests` function. Something like the following for your spider class.
Please note I did not test it. To handle parallel processing I believe you should change your settings. Especially if you are going to do two million requests rapidly they might block you.
0 Votes
Login to post a comment