Start a new topic
Answered

How to use long start urls list with scrapinghub?

I have a large file (150mb) that has 2 million start urls. What whould be the best way to pass them to scrapy spider hosted on scrapinghub?


Best Answer

Limit size for resource file is 50 MB, you should probably try deploying as a custom image: https://support.scrapinghub.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud or download from an external source on spider start, like S3.


What file do you have? I think the easiest would be to simple attach a generator to your `start_requests` function. Something like the following for your spider class. 


def start_requests(self):
    with open('file.csv', r) as fh:
        for line in fh.readlines():
            yield scrapy.Request(
                line, callback=self.parse_feed,
            )

Please note I did not test it. To handle parallel processing I believe you should change your settings. Especially if you are going to do two million requests rapidly they might block you. 

I think the problem might be to deploy the project that large.

What do you mean? 2 million rows is big, but I think people crawl bigger sets of data. Try to add the CSV file as a requirement in your configuration and it should probably work. Maybe start with a subset of your start URLs.

Answer

Limit size for resource file is 50 MB, you should probably try deploying as a custom image: https://support.scrapinghub.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud or download from an external source on spider start, like S3.


1 person likes this
Thanks Nestor. That is what I was worried about as well as deploy time. I will use Google sheet as a source. Though it will read all 150 mb into memmory all at once which is not good.
Login to post a comment