Start a new topic

503 errors and how to avoid them

So here are my scrapy logs:

 'downloader/request_method_count/GET': 2853,

 'downloader/response_count': 2822,

 'downloader/response_status_count/200': 1291,

 'downloader/response_status_count/301': 932,

 'downloader/response_status_count/502': 26,

 'downloader/response_status_count/503': 525,

 'downloader/response_status_count/504': 48,

OK so... I have c50 service for crawlera.  Mildly disappointed in having over 500, 503 responses.  Crawlera boasts features such as being able to handle crawl speed, and detecting bans, and all sorts of other exciting features that made me interested in starting a plan here.  So I would imagine that maybe there was a more diverse amount of IP's that crawlera would refer to when the first 500 weren't working.  But no...  Instead it retried countless amounts of times, and slowed the download timeout to a ridiculous pace.  From scraping 100 items in a little over an hour, to 100 in 3 hours, to 50 in 3 hours...  I was ultra excited, at first.  Don't get me wrong.  Crawlera helped me bypass the captcha that I was solving with deathbycaptcha service, but now I need help. Unfortunately regardless of all this I wanted to hit the site slowly, setting the concurrent_requests setting to 1, but 525 IP's still managed to get banned, where as previous to crawlera my one IP was never banned after hundreds of hours of crawling.  So I am asking, what now?

How can I avoid the 503 service unavailable errors?

Clearly, a slower crawl time didn't work, the 5 retries didn't work, and the site I am scraping is not overly busy like amazon or google...  Secure?  Sure, they use captcha and cloudflare and probably lots of other scraper prevention methods that have not been made clear to me yet, besides this situation.

Do they get charged to my crawlera account?

I hope not.  That would be rather unfortunate.

Is there any way to now "unban" these IP addresses from the site I am scraping?  Do I just need to wait it out?  Do I get new IP addresses each time crawlera starts in my scraper?  Just curious.  Most of the information on this particular topic is not very concise.

I mean there is this:

"Crawlera isn't perfect and sometimes it just won't have enough capacity to fetch pages from a specific domain, even after trying many outgoing nodes.

" from another familiar post.  But I mean...  that sentence, or the rest of the post doesn't give a solid solution.  More like a pat on the back saying "you're on your own buddy."

So please, can anyone help me? :)

13 people have this question
Login to post a comment