Hi, we recently completed a large crawl of a site using Scrapy and once completed we noticed thousands of Crawlera 429 errors. We were on the C10 plan and in the spider custom settings I had set it as follows:
As you can see the concurrency should have been capped at 10.
I noticed in the logs that the 429 errors only occurred on the image requests, i.e. the Scrapy item image_urls field. Is Scrapy not properly maintaining concurrent requests? The images are hosted on a server different then the website being scraped so I could see how CONCURRENT_REQUESTS_PER_DOMAIN was to blame but since we also have CONCURRENT_REQUESTS set to 10 I was under the impression this was a global setting for the spider.
Do you have a suggestion why this happened? It wasted a large number of requests that now need to be rerun. Thanks!
0 Votes
1 Comments
d
dumperiaposted
over 5 years ago
I also wanted to add, this was not for every image request, just a subset of them.
Hi, we recently completed a large crawl of a site using Scrapy and once completed we noticed thousands of Crawlera 429 errors. We were on the C10 plan and in the spider custom settings I had set it as follows:
As you can see the concurrency should have been capped at 10.
I noticed in the logs that the 429 errors only occurred on the image requests, i.e. the Scrapy item image_urls field. Is Scrapy not properly maintaining concurrent requests? The images are hosted on a server different then the website being scraped so I could see how CONCURRENT_REQUESTS_PER_DOMAIN was to blame but since we also have CONCURRENT_REQUESTS set to 10 I was under the impression this was a global setting for the spider.
Do you have a suggestion why this happened? It wasted a large number of requests that now need to be rerun. Thanks!
0 Votes
1 Comments
dumperia posted over 5 years ago
I also wanted to add, this was not for every image request, just a subset of them.
0 Votes
Login to post a comment