Start a new topic
Answered

Crawlera extremly slow

Hey,

We must use Crawlera for a site because otherwise the request is blocked but it goes so slow. We have an average of 1 request per minute and 0.4 items per minute and this is too slow.


 

def parse(self, response):
        for href in response.css('...::attr(href)'):
            yield response.follow(self.base_url+href.extract(), self.parse_content)

        for href in response.css(...::attr(href)'):
            yield response.follow(href, self.parse)


 The code is pretty straight forward, let's say it's a bunch of articles with pagination. 

Also on every article we need to visit a redirection link to get the final url.


 

 try: 
                urlToReturn = urllib2.urlopen(articleURL, timeout=30).geturl()
                return urlToReturn
            except:
                return articleURL

 


And the idea is to run this every day but right now the spinder has not finnished after 7 days. What can we do? And why is it so slow? Are we missing something? 

All suggestions are appreciated.

Best regards Joacim


Best Answer

You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.

About the 3xxs, you're requesting URLs such as https://example.com/ and it is being redirected to https://www.example.com/, so you might want to revise the URL scheme.


Also we get a bunch of 301s and 5xx responses. Don't get why.

Answer

You should disable Autothrottle from the settings when using Crawlera and then restart your job for the changes in settings to take effect, Autothrottle comes enabled by default in Scrapy Cloud.

About the 3xxs, you're requesting URLs such as https://example.com/ and it is being redirected to https://www.example.com/, so you might want to revise the URL scheme.


1 person likes this

Thanks nestor!

Disabling Autothrottle seems to have speeded up things. 


You're welcome :)

Hi Nestor, how do I disable autothrottle in Scrapy cloud settings?

On your project settings in the UI: https://app.scrapinghub.com/p/projectid/job-settings/standard. Just need to select it from the dropdown and make sure it is unchecked and the click Save.

Thank you Nestor! It worked!
However, I am still facing the same issue - Crawlera is very slow. I don't upload the spider to Scrapy Cloud. My spider employs Selenium + Polipo and its not so trivial to make a correct Docker image of it so that it could be uploaded to Scrapy cloud. So, I use it locally. I disabled Autothrottle in my project settings, however the I am still getting 10 items per minute on average when using Crawlera and around 150 items per minute without it. Can you please advise me if there is any way to speed Cralwera up? Maybe I should purchase some extra package?
P.S. I am crawling this website - https://www.jameda.de/

Try increasing CONCURRENT_REQUESTS (can be added on the UI settings too) to something within your plans limits, C50 plans can have up to 50. But if you put the max on your scrapy spdier then you won't be able to use it simultaneously on other applications as you will receive 429 error for exceeding the connections limit.

Hi, I'm having the same issue with crawlera, its only averaging around 4 items/min for a scrape of ~4000 items despite having the C100 plan with CONCURRENT_REQUESTS set to 50 and autothrottle turned off.


Could you let me know how this can be sped up?

Login to post a comment