Is this the proper way to use hooks with crawlera...so requests keeps trying until it the request is successful?
I'm no scraping expert. Any mods/improvements/deletions is welcome as I'm trying to get a little better at this....
Questions/Confusion:
Re the response, what should url_check variable be? Am I supposed to pass in a known/working url from the crawl site...or am I supposed to pass the actual url being crawled?
if its the actual url, how do you pass that variable into the check_status function?
r = requests.get(url, proxies=proxies,verify=CRAWLERA_CERT,
allow_redirects=True,
hooks=dict(response=check_status))
Usage:
r = requests.get(url, proxies=proxies, verify=CRAWLERA_CERT, allow_redirects=True, hooks=dict(response=check_status))
def check_status(response, *args, **kwargs):
sleepytime = 2000 #20 minutes
# This one works...instead, should it be the actual url?
url_check = 'http://www.homeadvisor.com/c.Countertops.Omaha.NE.-12016.html'
while response.status_code != 200:
if (response.status_code == 503):
print(response, response.status_code)
code = str(response.status_code)
# Log Error and Sleep Notice
push = pb.push_note(
"Home Advisor is giving a {0} error. Sleeping for {1:.0f} minutes".
format(response.status_code, sleepytime / 60),
"yep",
device=iphone)
# What's the time?
now = datetime.utcnow().replace(tzinfo=utc) # now in UTC
nowtime = now.astimezone(
timezone(Central)).strftime('%I:%M %Y-%m-%d')
logging.info(
"Home Advisor is giving a {0} error.\
Sleeping for {0:.0f} minutes at {1}"
.format(response.status_code, sleepytime / 60, nowtime))
# If a 503, Sleep for 20+ minutes...and try again
sleep(sleepytime + random.uniform(20, 100))
else:
logging.info("got non 429 error, Sleep for a couple seconds")
logging.info(response, response.status_code)
# If NOT a 503, Sleep 5+ seconds...and try again
sleep(5 + random.uniform(1, 8))
response = requests.head(
url_check,
proxies=proxies,
verify=CRAWLERA_CERT,
allow_redirects=True)
#print(response.status_code)
return response
Is this the proper way to use hooks with crawlera...so requests keeps trying until it the request is successful?
I'm no scraping expert. Any mods/improvements/deletions is welcome as I'm trying to get a little better at this....
Questions/Confusion:
Re the response, what should url_check variable be? Am I supposed to pass in a known/working url from the crawl site...or am I supposed to pass the actual url being crawled?
if its the actual url, how do you pass that variable into the check_status function?
Usage:
0 Votes
0 Comments
Login to post a comment