I'm able to scrape the site locally, but its not working through Crawlera. I've tried sending additional headers and it just hangs. I can get it working in browser and locally in Ruby
# Create URL
url = 'http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
uri = URI.parse(url)
# Proxy
proxy_host = "proxy.crawlera.com"
proxy_port = 8010
proxy_user = "APIKEY:"
proxy = Net::HTTP::Proxy(proxy_host, proxy_port, proxy_user)
# GET from site
request = Net::HTTP::Get.new(uri)
request["Accept-Language"] = "en-US,en;q=0.8,ru;q=0.6"
request["Accept-Encoding"] = "gzip, deflate"
request["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
request["Connection"] = "keep-alive"
#request["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
# req_options = {use_ssl: uri.scheme == "https",}
# Proxy response
response = proxy.start(uri.host,uri.port) do |http|
http.request(request)
end
# Working bit without proxy
# response = Net::HTTP.start(uri.hostname, uri.port, req_options) do |http|
# http.request(request)
# end
Don't think so. Try adding the rest of browser headers like "Cache-Control: max-age=0"
0 Votes
a
atomantposted
almost 7 years ago
Still timing out after 3 minutes
stats.nba.com will apparently timeout for AWS IPs.. Is it possible they're doing the same for proxy ips?
0 Votes
nestorposted
almost 7 years ago
Admin
Can you increase it to 180 at try again? Crawlera might be retrying with a different IP so the request could probably take longer.
0 Votes
a
atomantposted
almost 7 years ago
60 second timeout
Tried it many times in the last several days
0 Votes
nestorposted
almost 7 years ago
Admin
What's the timeout value on your client? Shouldn't be too short, otherwise Crawlera might not have enough time to respond.
0 Votes
a
atomantposted
almost 7 years ago
A timeout from the client
Net::ReadTimeout: Net::ReadTimeout
from (irb):65:in `block in irb_binding'
from (irb):64
0 Votes
nestorposted
almost 7 years ago
Admin
What kind of timeout? A timeout from your client or a timeout error from Crawlera with a 504 code?
0 Votes
a
atomantposted
almost 7 years ago
Thanks for the quick response.
After I noticed that I left the API key in I generated a new one.
I'm getting a timeout error when running it through Crawlera, and it works fine locally
0 Votes
nestorposted
almost 7 years ago
Admin
That doesn't seem to be a Crawlera API Key, also please not that this is a public forum so I've removed it from your post. In any case, please ensure you are using the API Key provided in your Crawlera dashboard https://app.scrapinghub.com/o/orgid/crawlera/crawlerauser/setup and let me know what error you get.
I'm able to scrape the site locally, but its not working through Crawlera. I've tried sending additional headers and it just hangs. I can get it working in browser and locally in Ruby
0 Votes
nestor posted almost 7 years ago Admin Best Answer
This works fine:
curl -U $APIKEY: -vx proxy.crawlera.com:8010 "http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=" -H "Accept: application/json, text/plain, */*" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: en-US,en;q=0.9" -H "Referer: http://stats.nba.com/teams/traditional/?sort=W_PCT&dir=-1"
or this:
curl -U $APIKEY: -vx proxy.crawlera.com:8010 "http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: en-US,en;q=0.9" -H "Cache-Control: max-age=0"
0 Votes
12 Comments
nestor posted almost 7 years ago Admin Answer
This works fine:
curl -U $APIKEY: -vx proxy.crawlera.com:8010 "http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=" -H "Accept: application/json, text/plain, */*" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: en-US,en;q=0.9" -H "Referer: http://stats.nba.com/teams/traditional/?sort=W_PCT&dir=-1"
or this:
curl -U $APIKEY: -vx proxy.crawlera.com:8010 "http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: en-US,en;q=0.9" -H "Cache-Control: max-age=0"
0 Votes
atomant posted almost 7 years ago
The code works fine with other URLs, just not the stats.nba.com one..
0 Votes
atomant posted almost 7 years ago
I added the rest from Chrome and I'm still seeing it timeout every time..
0 Votes
nestor posted almost 7 years ago Admin
Don't think so. Try adding the rest of browser headers like "Cache-Control: max-age=0"
0 Votes
atomant posted almost 7 years ago
Still timing out after 3 minutes
stats.nba.com will apparently timeout for AWS IPs.. Is it possible they're doing the same for proxy ips?
0 Votes
nestor posted almost 7 years ago Admin
Can you increase it to 180 at try again? Crawlera might be retrying with a different IP so the request could probably take longer.
0 Votes
atomant posted almost 7 years ago
60 second timeout
Tried it many times in the last several days
0 Votes
nestor posted almost 7 years ago Admin
What's the timeout value on your client? Shouldn't be too short, otherwise Crawlera might not have enough time to respond.
0 Votes
atomant posted almost 7 years ago
A timeout from the client
0 Votes
nestor posted almost 7 years ago Admin
What kind of timeout? A timeout from your client or a timeout error from Crawlera with a 504 code?
0 Votes
atomant posted almost 7 years ago
Thanks for the quick response.
After I noticed that I left the API key in I generated a new one.
I'm getting a timeout error when running it through Crawlera, and it works fine locally
0 Votes
nestor posted almost 7 years ago Admin
That doesn't seem to be a Crawlera API Key, also please not that this is a public forum so I've removed it from your post. In any case, please ensure you are using the API Key provided in your Crawlera dashboard https://app.scrapinghub.com/o/orgid/crawlera/crawlerauser/setup and let me know what error you get.
0 Votes
Login to post a comment