I'm able to scrape the site locally, but its not working through Crawlera. I've tried sending additional headers and it just hangs. I can get it working in browser and locally in Ruby
# Create URL
url = 'http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='
uri = URI.parse(url)
# Proxy
proxy_host = "proxy.crawlera.com"
proxy_port = 8010
proxy_user = "APIKEY:"
proxy = Net::HTTP::Proxy(proxy_host, proxy_port, proxy_user)
# GET from site
request = Net::HTTP::Get.new(uri)
request["Accept-Language"] = "en-US,en;q=0.8,ru;q=0.6"
request["Accept-Encoding"] = "gzip, deflate"
request["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"
request["Connection"] = "keep-alive"
#request["User-Agent"] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
# req_options = {use_ssl: uri.scheme == "https",}
# Proxy response
response = proxy.start(uri.host,uri.port) do |http|
http.request(request)
end
# Working bit without proxy
# response = Net::HTTP.start(uri.hostname, uri.port, req_options) do |http|
# http.request(request)
# end
That doesn't seem to be a Crawlera API Key, also please not that this is a public forum so I've removed it from your post. In any case, please ensure you are using the API Key provided in your Crawlera dashboard https://app.scrapinghub.com/o/orgid/crawlera/crawlerauser/setup and let me know what error you get.
a
atomant
said
over 6 years ago
Thanks for the quick response.
After I noticed that I left the API key in I generated a new one.
I'm getting a timeout error when running it through Crawlera, and it works fine locally
nestor
said
over 6 years ago
What kind of timeout? A timeout from your client or a timeout error from Crawlera with a 504 code?
a
atomant
said
over 6 years ago
A timeout from the client
Net::ReadTimeout: Net::ReadTimeout
from (irb):65:in `block in irb_binding'
from (irb):64
nestor
said
over 6 years ago
What's the timeout value on your client? Shouldn't be too short, otherwise Crawlera might not have enough time to respond.
a
atomant
said
over 6 years ago
60 second timeout
Tried it many times in the last several days
nestor
said
over 6 years ago
Can you increase it to 180 at try again? Crawlera might be retrying with a different IP so the request could probably take longer.
a
atomant
said
over 6 years ago
Still timing out after 3 minutes
stats.nba.com will apparently timeout for AWS IPs.. Is it possible they're doing the same for proxy ips?
nestor
said
over 6 years ago
Don't think so. Try adding the rest of browser headers like "Cache-Control: max-age=0"
a
atomant
said
over 6 years ago
I added the rest from Chrome and I'm still seeing it timeout every time..
atomant
I'm able to scrape the site locally, but its not working through Crawlera. I've tried sending additional headers and it just hangs. I can get it working in browser and locally in Ruby
This works fine:
curl -U $APIKEY: -vx proxy.crawlera.com:8010 "http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=" -H "Accept: application/json, text/plain, */*" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: en-US,en;q=0.9" -H "Referer: http://stats.nba.com/teams/traditional/?sort=W_PCT&dir=-1"
or this:
curl -U $APIKEY: -vx proxy.crawlera.com:8010 "http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: en-US,en;q=0.9" -H "Cache-Control: max-age=0"
- Oldest First
- Popular
- Newest First
Sorted by Oldest Firstnestor
That doesn't seem to be a Crawlera API Key, also please not that this is a public forum so I've removed it from your post. In any case, please ensure you are using the API Key provided in your Crawlera dashboard https://app.scrapinghub.com/o/orgid/crawlera/crawlerauser/setup and let me know what error you get.
atomant
Thanks for the quick response.
After I noticed that I left the API key in I generated a new one.
I'm getting a timeout error when running it through Crawlera, and it works fine locally
nestor
What kind of timeout? A timeout from your client or a timeout error from Crawlera with a 504 code?
atomant
A timeout from the client
nestor
What's the timeout value on your client? Shouldn't be too short, otherwise Crawlera might not have enough time to respond.
atomant
60 second timeout
Tried it many times in the last several days
nestor
Can you increase it to 180 at try again? Crawlera might be retrying with a different IP so the request could probably take longer.
atomant
Still timing out after 3 minutes
stats.nba.com will apparently timeout for AWS IPs.. Is it possible they're doing the same for proxy ips?
nestor
Don't think so. Try adding the rest of browser headers like "Cache-Control: max-age=0"
atomant
I added the rest from Chrome and I'm still seeing it timeout every time..
atomant
The code works fine with other URLs, just not the stats.nba.com one..
nestor
This works fine:
curl -U $APIKEY: -vx proxy.crawlera.com:8010 "http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=" -H "Accept: application/json, text/plain, */*" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: en-US,en;q=0.9" -H "Referer: http://stats.nba.com/teams/traditional/?sort=W_PCT&dir=-1"
or this:
curl -U $APIKEY: -vx proxy.crawlera.com:8010 "http://stats.nba.com/stats/leaguedashteamstats?Season=2014-15&MeasureType=Base&SeasonType=Regular+Season&PerMode=PerGame&Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&SeasonSegment=&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: en-US,en;q=0.9" -H "Cache-Control: max-age=0"
-
Crawlera 503 Ban
-
Amazon scraping speed
-
Website redirects
-
Error Code 429 Too Many Requests
-
Bing
-
Subscribed to Crawlera but saying Not Subscribed
-
Selenium with c#
-
Using Crawlera with browsermob
-
CRAWLERA_PRESERVE_DELAY leads to error
-
How to connect Selenium PhantomJS to Crawlera?
See all 401 topics