How to use long start urls list with scrapinghub?

Posted almost 7 years ago by Artem M

Post a topic

Answered

Artem M

I have a large file (150mb) that has 2 million start urls. What whould be the best way to pass them to scrapy spider hosted on scrapinghub?

0 Votes

nestor posted almost 7 years ago Admin Best Answer

Limit size for resource file is 50 MB, you should probably try deploying as a custom image: https://support.scrapinghub.com/support/solutions/articles/22000200425-deploying-custom-docker-images-on-scrapy-cloud or download from an external source on spider start, like S3.

1 Votes

5 Comments

nestor posted almost 7 years ago Admin Answer

1 Votes

jwaterschoot posted almost 7 years ago

What file do you have? I think the easiest would be to simple attach a generator to your `start_requests` function. Something like the following for your spider class.

def start_requests(self):
    with open('file.csv', r) as fh:
        for line in fh.readlines():
            yield scrapy.Request(
                line, callback=self.parse_feed,
            )

Please note I did not test it. To handle parallel processing I believe you should change your settings. Especially if you are going to do two million requests rapidly they might block you.

0 Votes

Artem M posted almost 7 years ago

I think the problem might be to deploy the project that large.

0 Votes

jwaterschoot posted almost 7 years ago

What do you mean? 2 million rows is big, but I think people crawl bigger sets of data. Try to add the CSV file as a requirement in your configuration and it should probably work. Maybe start with a subset of your start URLs.

0 Votes

Artem M posted almost 7 years ago

Thanks Nestor. That is what I was worried about as well as deploy time. I will use Google sheet as a source. Though it will read all 150 mb into memmory all at once which is not good.

0 Votes