Start a new topic
Answered

403 Errors when deployed but not locally

Hi


I have had several spiders working fine for months when deployed on scrapinghub but recently have started failing immediately on the first request due to 403 errors.


Is this perhaps due to scrapinghub's standard servers (non Crawlera) using a common pool of IPs? Most of my target sites or in Australia so perhaps they have started geoblocking to Australian IPs?


All of these spiders still work fine from a local machine.


Is the best solution to use crawlera and set region to Australia? 


If so, is there are way to speed the spiders up, for example is it faster to use a single crawlera session (single IP) for the entire crawl rather than a new crawlera session for each request?


Thanks


Best Answer

As Jwaterschoot said, sometimes setting User-Agents helps.


But if the Sites require geo-specific IPs you would need to use Crawlera and set the account to use Australian IPs as given here .


Sessions can help in making multiple requests using a single IP, but new session would be required if you experience a ban. Hence you can make a bulk of requests using a session ID, and whenever you experience a ban or session expires(Sessions auto expire after 30 minutes of last use) you would need to use new Session ID.


New crawlera session for each request is similar to making request using Crawlera without sessions, as different IP would be used in both the cases. Please refer the article to know more about the sessions.




Do you use the same headers in the requests? I often had the problem that the behaviour locally and on cloud is different when no User-Agent is set.

Answer

As Jwaterschoot said, sometimes setting User-Agents helps.


But if the Sites require geo-specific IPs you would need to use Crawlera and set the account to use Australian IPs as given here .


Sessions can help in making multiple requests using a single IP, but new session would be required if you experience a ban. Hence you can make a bulk of requests using a session ID, and whenever you experience a ban or session expires(Sessions auto expire after 30 minutes of last use) you would need to use new Session ID.


New crawlera session for each request is similar to making request using Crawlera without sessions, as different IP would be used in both the cases. Please refer the article to know more about the sessions.



Hi,

I have enabled Crawlera with region set to AU and this generally works but is 5-10x slower than previously.


Is there a faster way to run these spiders if I don't need a new IP per request, I simply just need it to run from an Australian IP for the entire crawl? Would using crawlera sessions speed things up or actually add overhead?


Cheers

Using sessions, help if there is requirement of using same IP for making multiple requests like Authentication/Login.  With sessions there is a default 12 secs delay between same IP use.


You can speed up the crawl by increasing the concurrency. In scrapy cloud Autothrottle is Enabled by default, hence you would need to disable the Autothrottle and use Concurrency settings as given in https://support.scrapinghub.com/solution/articles/22000188399-using-crawlera-with-scrapy.



Login to post a comment