Start a new topic

503 error / X-Crawlera-Error: noslaves

Hi
Sometimes I receive that error from Crawlera and after that I'm not able to scrape anymore. Yesterday it happened during the afternoon and I was not able to scrape again for the rest of the day. Today it was working again but after a while same thing happened.

What is the proper way for minimizing this error and for dealing with it?

Retry and wait?


Thank you.


1 person has this question

I've been running into the same issue recently. No changes to code my end but with the last couple of weeks we're seeing a lot of `X-Crawlera-Error: noslaves`.


Very anecdotal example:


- Arrive in the morning, set off a scrape. Successfully fetch a few hundred pages and finish, with no 503s of any kind.

- Immediately after the first scrape has finished, set off another one. Either pointing at the same urls or different urls from the same site. This time and seemingly all subsequent times: ``X-Crawlera-Error: noslaves` and zero pages downloaded.

I've been trying the retry and wait approach.


At the moment I'm watching a scrape with the following custom version of the `RetryMiddleware`


I'm getting about a page every 45mins.


Are we doing something wrong, or are there just "noslaves" for most of the day at the moment?



class CustomRetryMiddleware(RetryMiddleware):

    def process_response(self, request, response, spider):

        """Handle "noslaves" errors with custom code, otherwise

        use the same implementation as the original RetryMiddleware

        """


        no_slaves = response.headers.get("X-Crawlera-Error") == b"noslaves"


        if no_slaves:

            print(f"\n\nno slaves\n\n")


        if no_slaves:

            """Here, there are no servers available so crawlera won't try again.

            By default we'd return a 503 which scrapy would immediately try again,

            (and there would probably not be a slave available)

    

            Instead of returning a 503 we trigger another retry after sleeping for

            a while

    

            (hopefully this approach won't get rate-limited or banned by crawlera)

    

            We bypass the normal max_retry_times used for other requests as we

            don't want to stop crawling until we've had access to some servers

            """

    

        debug("sleeping")

        time.sleep(20)

        debug("sending request")


        new_request = request.copy()

        new_request.meta["max_retry_times"] = math.inf


        return (

            self._retry(new_request, f"noslaves for {request.url}", spider)

            or response

        )


        # normal behaviour

        if request.meta.get("dont_retry", False):

            return response

        if response.status in self.retry_http_codes:

            reason = response_status_message(response.status)

            return self._retry(request, reason, spider) or response

        return response


(The indentation in my code example is off)

Login to post a comment