Is this the proper way to use hooks with crawlera...so requests keeps trying until it the request is successful?
I'm no scraping expert. Any mods/improvements/deletions is welcome as I'm trying to get a little better at this....
Questions/Confusion:
Re the response, what should url_check variable be? Am I supposed to pass in a known/working url from the crawl site...or am I supposed to pass the actual url being crawled?
if its the actual url, how do you pass that variable into the check_status function?
r = requests.get(url, proxies=proxies,verify=CRAWLERA_CERT,
allow_redirects=True,
hooks=dict(response=check_status))
Usage:
r = requests.get(url, proxies=proxies, verify=CRAWLERA_CERT, allow_redirects=True, hooks=dict(response=check_status))
def check_status(response, *args, **kwargs):
sleepytime = 2000 #20 minutes
# This one works...instead, should it be the actual url?
url_check = 'http://www.homeadvisor.com/c.Countertops.Omaha.NE.-12016.html'
while response.status_code != 200:
if (response.status_code == 503):
print(response, response.status_code)
code = str(response.status_code)
# Log Error and Sleep Notice
push = pb.push_note(
"Home Advisor is giving a {0} error. Sleeping for {1:.0f} minutes".
format(response.status_code, sleepytime / 60),
"yep",
device=iphone)
# What's the time?
now = datetime.utcnow().replace(tzinfo=utc) # now in UTC
nowtime = now.astimezone(
timezone(Central)).strftime('%I:%M %Y-%m-%d')
logging.info(
"Home Advisor is giving a {0} error.\
Sleeping for {0:.0f} minutes at {1}"
.format(response.status_code, sleepytime / 60, nowtime))
# If a 503, Sleep for 20+ minutes...and try again
sleep(sleepytime + random.uniform(20, 100))
else:
logging.info("got non 429 error, Sleep for a couple seconds")
logging.info(response, response.status_code)
# If NOT a 503, Sleep 5+ seconds...and try again
sleep(5 + random.uniform(1, 8))
response = requests.head(
url_check,
proxies=proxies,
verify=CRAWLERA_CERT,
allow_redirects=True)
#print(response.status_code)
return response
Is this the proper way to use hooks with crawlera...so requests keeps trying until it the request is successful?
I'm no scraping expert. Any mods/improvements/deletions is welcome as I'm trying to get a little better at this....
Questions/Confusion:
Re the response, what should url_check variable be? Am I supposed to pass in a known/working url from the crawl site...or am I supposed to pass the actual url being crawled?
response = requests.head( url_check, proxies=proxies, verify=CRAWLERA_CERT, allow_redirects=True)if its the actual url, how do you pass that variable into the check_status function?
r = requests.get(url, proxies=proxies,verify=CRAWLERA_CERT, allow_redirects=True, hooks=dict(response=check_status))Usage:
def check_status(response, *args, **kwargs): sleepytime = 2000 #20 minutes # This one works...instead, should it be the actual url? url_check = 'http://www.homeadvisor.com/c.Countertops.Omaha.NE.-12016.html' while response.status_code != 200: if (response.status_code == 503): print(response, response.status_code) code = str(response.status_code) # Log Error and Sleep Notice push = pb.push_note( "Home Advisor is giving a {0} error. Sleeping for {1:.0f} minutes". format(response.status_code, sleepytime / 60), "yep", device=iphone) # What's the time? now = datetime.utcnow().replace(tzinfo=utc) # now in UTC nowtime = now.astimezone( timezone(Central)).strftime('%I:%M %Y-%m-%d') logging.info( "Home Advisor is giving a {0} error.\ Sleeping for {0:.0f} minutes at {1}" .format(response.status_code, sleepytime / 60, nowtime)) # If a 503, Sleep for 20+ minutes...and try again sleep(sleepytime + random.uniform(20, 100)) else: logging.info("got non 429 error, Sleep for a couple seconds") logging.info(response, response.status_code) # If NOT a 503, Sleep 5+ seconds...and try again sleep(5 + random.uniform(1, 8)) response = requests.head( url_check, proxies=proxies, verify=CRAWLERA_CERT, allow_redirects=True) #print(response.status_code) return response0 Votes
0 Comments
Login to post a comment