I have had several spiders working fine for months when deployed on scrapinghub but recently have started failing immediately on the first request due to 403 errors.
Is this perhaps due to scrapinghub's standard servers (non Crawlera) using a common pool of IPs? Most of my target sites or in Australia so perhaps they have started geoblocking to Australian IPs?
All of these spiders still work fine from a local machine.
Is the best solution to use crawlera and set region to Australia?
If so, is there are way to speed the spiders up, for example is it faster to use a single crawlera session (single IP) for the entire crawl rather than a new crawlera session for each request?
Thanks
0 Votes
thriveni posted
over 6 years ago
AdminBest Answer
As Jwaterschoot said, sometimes setting User-Agents helps.
But if the Sites require geo-specific IPs you would need to use Crawlera and set the account to use Australian IPs as given here .
Sessions can help in making multiple requests using a single IP, but new session would be required if you experience a ban. Hence you can make a bulk of requests using a session ID, and whenever you experience a ban or session expires(Sessions auto expire after 30 minutes of last use) you would need to use new Session ID.
New crawlera session for each request is similar to making request using Crawlera without sessions, as different IP would be used in both the cases. Please refer the article to know more about the sessions.
0 Votes
4 Comments
Sorted by
j
jwaterschootposted
over 6 years ago
Do you use the same headers in the requests? I often had the problem that the behaviour locally and on cloud is different when no User-Agent is set.
0 Votes
thriveniposted
over 6 years ago
AdminAnswer
As Jwaterschoot said, sometimes setting User-Agents helps.
But if the Sites require geo-specific IPs you would need to use Crawlera and set the account to use Australian IPs as given here .
Sessions can help in making multiple requests using a single IP, but new session would be required if you experience a ban. Hence you can make a bulk of requests using a session ID, and whenever you experience a ban or session expires(Sessions auto expire after 30 minutes of last use) you would need to use new Session ID.
New crawlera session for each request is similar to making request using Crawlera without sessions, as different IP would be used in both the cases. Please refer the article to know more about the sessions.
0 Votes
A
Aaron Cowperposted
about 6 years ago
Hi,
I have enabled Crawlera with region set to AU and this generally works but is 5-10x slower than previously.
Is there a faster way to run these spiders if I don't need a new IP per request, I simply just need it to run from an Australian IP for the entire crawl? Would using crawlera sessions speed things up or actually add overhead?
Cheers
0 Votes
thriveniposted
about 6 years ago
Admin
Using sessions, help if there is requirement of using same IP for making multiple requests like Authentication/Login. With sessions there is a default 12 secs delay between same IP use.
Hi
I have had several spiders working fine for months when deployed on scrapinghub but recently have started failing immediately on the first request due to 403 errors.
Is this perhaps due to scrapinghub's standard servers (non Crawlera) using a common pool of IPs? Most of my target sites or in Australia so perhaps they have started geoblocking to Australian IPs?
All of these spiders still work fine from a local machine.
Is the best solution to use crawlera and set region to Australia?
If so, is there are way to speed the spiders up, for example is it faster to use a single crawlera session (single IP) for the entire crawl rather than a new crawlera session for each request?
Thanks
0 Votes
thriveni posted over 6 years ago Admin Best Answer
As Jwaterschoot said, sometimes setting User-Agents helps.
But if the Sites require geo-specific IPs you would need to use Crawlera and set the account to use Australian IPs as given here .
Sessions can help in making multiple requests using a single IP, but new session would be required if you experience a ban. Hence you can make a bulk of requests using a session ID, and whenever you experience a ban or session expires(Sessions auto expire after 30 minutes of last use) you would need to use new Session ID.
New crawlera session for each request is similar to making request using Crawlera without sessions, as different IP would be used in both the cases. Please refer the article to know more about the sessions.
0 Votes
4 Comments
jwaterschoot posted over 6 years ago
Do you use the same headers in the requests? I often had the problem that the behaviour locally and on cloud is different when no User-Agent is set.
0 Votes
thriveni posted over 6 years ago Admin Answer
As Jwaterschoot said, sometimes setting User-Agents helps.
But if the Sites require geo-specific IPs you would need to use Crawlera and set the account to use Australian IPs as given here .
Sessions can help in making multiple requests using a single IP, but new session would be required if you experience a ban. Hence you can make a bulk of requests using a session ID, and whenever you experience a ban or session expires(Sessions auto expire after 30 minutes of last use) you would need to use new Session ID.
New crawlera session for each request is similar to making request using Crawlera without sessions, as different IP would be used in both the cases. Please refer the article to know more about the sessions.
0 Votes
Aaron Cowper posted about 6 years ago
Hi,
I have enabled Crawlera with region set to AU and this generally works but is 5-10x slower than previously.
Is there a faster way to run these spiders if I don't need a new IP per request, I simply just need it to run from an Australian IP for the entire crawl? Would using crawlera sessions speed things up or actually add overhead?
Cheers
0 Votes
thriveni posted about 6 years ago Admin
Using sessions, help if there is requirement of using same IP for making multiple requests like Authentication/Login. With sessions there is a default 12 secs delay between same IP use.
You can speed up the crawl by increasing the concurrency. In scrapy cloud Autothrottle is Enabled by default, hence you would need to disable the Autothrottle and use Concurrency settings as given in https://support.scrapinghub.com/solution/articles/22000188399-using-crawlera-with-scrapy.
0 Votes
Login to post a comment