Crawlera 429 response code only for image requests, concurrency set correctly.

Posted over 6 years ago by dumperia

Post a topic

Un Answered

dumperia

Hi, we recently completed a large crawl of a site using Scrapy and once completed we noticed thousands of Crawlera 429 errors. We were on the C10 plan and in the spider custom settings I had set it as follows:

        'DOWNLOADER_MIDDLEWARES': {
                'scrapy_crawlera.CrawleraMiddleware': 300,
         },
        'CRAWLERA_ENABLED': True,
        'CONCURRENT_REQUESTS': 10,
        'CONCURRENT_REQUESTS_PER_DOMAIN': 10,
        'AUTOTHROTTLE_ENABLED': False,
        'DOWNLOAD_TIMEOUT': 600,

As you can see the concurrency should have been capped at 10.

I noticed in the logs that the 429 errors only occurred on the image requests, i.e. the Scrapy item image_urls field. Is Scrapy not properly maintaining concurrent requests? The images are hosted on a server different then the website being scraped so I could see how CONCURRENT_REQUESTS_PER_DOMAIN was to blame but since we also have CONCURRENT_REQUESTS set to 10 I was under the impression this was a global setting for the spider.

Do you have a suggestion why this happened? It wasted a large number of requests that now need to be rerun. Thanks!

0 Votes

1 Comments

dumperia posted over 6 years ago

I also wanted to add, this was not for every image request, just a subset of them.

0 Votes