Start a new topic

Crawlera 429 response code only for image requests, concurrency set correctly.

Hi, we recently completed a large crawl of a site using Scrapy and once completed we noticed thousands of Crawlera 429 errors.  We were on the C10 plan and in the spider custom settings I had set it as follows:

 

        'DOWNLOADER_MIDDLEWARES': {
                'scrapy_crawlera.CrawleraMiddleware': 300,
         },
        'CRAWLERA_ENABLED': True,
        'CONCURRENT_REQUESTS': 10,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 10,
        'AUTOTHROTTLE_ENABLED': False,
        'DOWNLOAD_TIMEOUT': 600,

 

As you can see the concurrency should have been capped at 10.  


I noticed in the logs that the 429 errors only occurred on the image requests, i.e. the Scrapy item image_urls field.  Is Scrapy not properly maintaining concurrent requests?  The images are hosted on a server different then the website being scraped so I could see how CONCURRENT_REQUESTS_PER_DOMAIN was to blame but since we also have CONCURRENT_REQUESTS set to 10 I was under the impression this was a global setting for the spider.


Do you have a suggestion why this happened?  It wasted a large number of requests that now need to be rerun.  Thanks!

1 Comment

I also wanted to add, this was not for every image request, just a subset of them.

Login to post a comment