How to use Hooks in Python Requests (when using Crawlera)

Posted over 7 years ago by Graham Anderson

Post a topic

Un Answered

Graham Anderson

Is this the proper way to use hooks with crawlera...so requests keeps trying until it the request is successful?

I'm no scraping expert. Any mods/improvements/deletions is welcome as I'm trying to get a little better at this....

Questions/Confusion:

Re the response, what should url_check variable be? Am I supposed to pass in a known/working url from the crawl site...or am I supposed to pass the actual url being crawled?

        response = requests.head(
            url_check,
            proxies=proxies,
            verify=CRAWLERA_CERT,
            allow_redirects=True)

if its the actual url, how do you pass that variable into the check_status function?

r = requests.get(url, proxies=proxies,verify=CRAWLERA_CERT, 
                allow_redirects=True,
                hooks=dict(response=check_status))

Usage:

r = requests.get(url,  proxies=proxies, verify=CRAWLERA_CERT, allow_redirects=True, hooks=dict(response=check_status))

def check_status(response, *args, **kwargs):
    sleepytime = 2000  #20 minutes
    # This one works...instead, should it be the actual url?
    url_check = 'http://www.homeadvisor.com/c.Countertops.Omaha.NE.-12016.html'

    while response.status_code != 200:
        if (response.status_code == 503):
            print(response, response.status_code)
            code = str(response.status_code)

            # Log Error and Sleep Notice
            push = pb.push_note(
                "Home Advisor is giving a {0} error. Sleeping for {1:.0f} minutes".
                format(response.status_code, sleepytime / 60),
                "yep",
                device=iphone)

            # What's the time?
            now = datetime.utcnow().replace(tzinfo=utc)  # now in UTC
            nowtime = now.astimezone(
                timezone(Central)).strftime('%I:%M %Y-%m-%d')

            logging.info(
                "Home Advisor is giving a {0} error.\
            Sleeping for {0:.0f} minutes at {1}"
                .format(response.status_code, sleepytime / 60, nowtime))
            
            # If a 503, Sleep for 20+ minutes...and try again
            sleep(sleepytime + random.uniform(20, 100))
        
        else:
            logging.info("got non 429 error, Sleep for a couple seconds")
            logging.info(response, response.status_code)
            
            # If NOT a 503, Sleep 5+ seconds...and try again
            sleep(5 + random.uniform(1, 8))
        
        response = requests.head(
            url_check,
            proxies=proxies,
            verify=CRAWLERA_CERT,
            allow_redirects=True)
    #print(response.status_code)
    return response

0 Votes

0 Comments