Start a new topic
Answered

Not working with stats.nba.com

I'm able to scrape the site locally, but its not working through Crawlera. I've tried sending additional headers and it just hangs. I can get it working in browser and locally in Ruby


  

# Create URL

url = 'http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
uri = URI.parse(url)

# Proxy
proxy_host = "proxy.crawlera.com"
proxy_port = 8010
proxy_user = "APIKEY:"
proxy = Net::HTTP::Proxy(proxy_host, proxy_port, proxy_user)

# GET from site
request = Net::HTTP::Get.new(uri)
request["Accept-Language"] = "en-US,en;q=0.8,ru;q=0.6"
request["Accept-Encoding"] = "gzip, deflate"
request["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
request["Connection"] = "keep-alive"
#request["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
# req_options = {use_ssl: uri.scheme == "https",}

# Proxy response
response = proxy.start(uri.host,uri.port) do |http|
 http.request(request)
end

# Working bit without proxy
# response = Net::HTTP.start(uri.hostname, uri.port, req_options) do |http|
#   http.request(request)
# end
 

  



That doesn't seem to be a Crawlera API Key, also please not that this is a public forum so I've removed it from your post. In any case, please ensure you are using the API Key provided in your Crawlera dashboard https://app.scrapinghub.com/o/orgid/crawlera/crawlerauser/setup and let me know what error you get.

Thanks for the quick response.

After I noticed that I left the API key in I generated a new one.

I'm getting a timeout error when running it through Crawlera, and it works fine locally

What kind of timeout? A timeout from your client or a timeout error from Crawlera with a 504 code?

A timeout from the client

 

Net::ReadTimeout: Net::ReadTimeout
        from (irb):65:in `block in irb_binding'
        from (irb):64

 

What's the timeout value on your client? Shouldn't be too short, otherwise Crawlera might not have enough time to respond.

60 second timeout

Tried it many times in the last several days

Can you increase it to 180 at try again? Crawlera might be retrying with a different IP so the request could probably take longer.

Still timing out after 3 minutes

stats.nba.com will apparently timeout for AWS IPs.. Is it possible they're doing the same for proxy ips?

Don't think so. Try adding the rest of browser headers like "Cache-Control: max-age=0"

I added the rest from Chrome and I'm still seeing it timeout every time..



 

      request["Accept-Language"] = "en-US,en;q=0.8,ru;q=0.6"
      request["Accept-Encoding"] = "gzip, deflate"
      request["Accept"] = "application/json, text/plain, */*"
      request["Connection"] = "keep-alive"
      request["x-nba-stats-token"] = "true"
      request["Referer"] = "http://stats.nba.com/teams/traditional/"
      request["x-nba-stats-origin"] = "stats"

 

The code works fine with other URLs, just not the stats.nba.com one..

Login to post a comment