Start a new topic

How to use Hooks in Python Requests (when using Crawlera)

Is this the proper way to use hooks with crawlera...so requests keeps trying until it the request is successful?


I'm no scraping expert. Any mods/improvements/deletions is welcome as I'm trying to get a little better at this.... 


Questions/Confusion:

Re the response, what should url_check variable be? Am I supposed to pass in a known/working url from the crawl site...or am I supposed to pass the actual url being crawled?

        response = requests.head(
            url_check,
            proxies=proxies,
            verify=CRAWLERA_CERT,
            allow_redirects=True)

if its the actual url, how do you pass that variable into the check_status function?


r = requests.get(url, proxies=proxies,verify=CRAWLERA_CERT, 
                allow_redirects=True,
                hooks=dict(response=check_status))

 


Usage:

r = requests.get(url,  proxies=proxies, verify=CRAWLERA_CERT, allow_redirects=True, hooks=dict(response=check_status))

      

def check_status(response, *args, **kwargs):
    sleepytime = 2000  #20 minutes
    # This one works...instead, should it be the actual url?
    url_check = 'http://www.homeadvisor.com/c.Countertops.Omaha.NE.-12016.html'

    while response.status_code != 200:
        if (response.status_code == 503):
            print(response, response.status_code)
            code = str(response.status_code)

            # Log Error and Sleep Notice
            push = pb.push_note(
                "Home Advisor is giving a {0} error. Sleeping for {1:.0f} minutes".
                format(response.status_code, sleepytime / 60),
                "yep",
                device=iphone)

            # What's the time?
            now = datetime.utcnow().replace(tzinfo=utc)  # now in UTC
            nowtime = now.astimezone(
                timezone(Central)).strftime('%I:%M %Y-%m-%d')

            logging.info(
                "Home Advisor is giving a {0} error.\
            Sleeping for {0:.0f} minutes at {1}"
                .format(response.status_code, sleepytime / 60, nowtime))
            
            # If a 503, Sleep for 20+ minutes...and try again
            sleep(sleepytime + random.uniform(20, 100))
        
        else:
            logging.info("got non 429 error, Sleep for a couple seconds")
            logging.info(response, response.status_code)
            
            # If NOT a 503, Sleep 5+ seconds...and try again
            sleep(5 + random.uniform(1, 8))
        
        response = requests.head(
            url_check,
            proxies=proxies,
            verify=CRAWLERA_CERT,
            allow_redirects=True)
    #print(response.status_code)
    return response

   

Login to post a comment