Crawlera extremly slow

Posted over 7 years ago by joacimgunnarsson

Post a topic

Answered

joacimgunnarsson

Hey,

We must use Crawlera for a site because otherwise the request is blocked but it goes so slow. We have an average of 1 request per minute and 0.4 items per minute and this is too slow.

def parse(self, response):
        for href in response.css('...::attr(href)'):
            yield response.follow(self.base_url+href.extract(), self.parse_content)

        for href in response.css(...::attr(href)'):
            yield response.follow(href, self.parse)

The code is pretty straight forward, let's say it's a bunch of articles with pagination.

Also on every article we need to visit a redirection link to get the final url.

 try: 
                urlToReturn = urllib2.urlopen(articleURL, timeout=30).geturl()
                return urlToReturn
            except:
                return articleURL

And the idea is to run this every day but right now the spinder has not finnished after 7 days. What can we do? And why is it so slow? Are we missing something?

All suggestions are appreciated.

Best regards Joacim

0 Votes

nestor posted over 7 years ago Admin Best Answer

You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.

About the 3xxs, you're requesting URLs such as https://example.com/ and it is being redirected to https://www.example.com/, so you might want to revise the URL scheme.

1 Votes

9 Comments

Aaron Cowper posted almost 7 years ago

Hi, I'm having the same issue with crawlera, its only averaging around 4 items/min for a scrape of ~4000 items despite having the C100 plan with CONCURRENT_REQUESTS set to 50 and autothrottle turned off.

Could you let me know how this can be sped up?

0 Votes

nestor posted over 7 years ago Admin

Try increasing CONCURRENT_REQUESTS (can be added on the UI settings too) to something within your plans limits, C50 plans can have up to 50. But if you put the max on your scrapy spdier then you won't be able to use it simultaneously on other applications as you will receive 429 error for exceeding the connections limit.

0 Votes

Ostapp posted over 7 years ago

Thank you Nestor! It worked!
However, I am still facing the same issue - Crawlera is very slow. I don't upload the spider to Scrapy Cloud. My spider employs Selenium + Polipo and its not so trivial to make a correct Docker image of it so that it could be uploaded to Scrapy cloud. So, I use it locally. I disabled Autothrottle in my project settings, however the I am still getting 10 items per minute on average when using Crawlera and around 150 items per minute without it. Can you please advise me if there is any way to speed Cralwera up? Maybe I should purchase some extra package?
P.S. I am crawling this website - https://www.jameda.de/

0 Votes

nestor posted over 7 years ago Admin

On your project settings in the UI: https://app.scrapinghub.com/p/projectid/job-settings/standard. Just need to select it from the dropdown and make sure it is unchecked and the click Save.

0 Votes

Ostapp posted over 7 years ago

Hi Nestor, how do I disable autothrottle in Scrapy cloud settings?

0 Votes

nestor posted over 7 years ago Admin

You're welcome :)

0 Votes

joacimgunnarsson posted over 7 years ago

Thanks nestor!

Disabling Autothrottle seems to have speeded up things.

0 Votes

nestor posted over 7 years ago Admin Answer

You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.

About the 3xxs, you're requesting URLs such as https://example.com/ and it is being redirected to https://www.example.com/, so you might want to revise the URL scheme.

1 Votes

joacimgunnarsson posted over 7 years ago

Also we get a bunch of 301s and 5xx responses. Don't get why.

0 Votes