Im using ScrapingHub's Scrapy Cloud to host my python Scrapy Project.
The spider runs fine when I run locally, but on ScrapinHub, 3 specific websites (they are 3 E-commerce stores from the same group, using the same website mechanics) times out. Like this:
[scrapy.downloadermiddlewares.retry] Retrying <GET https://www.submarino.com.br/produto/133739829/game-naruto-to-boruto-shinobi-striker-day-one-ps4> (failed 1 times): User timeout caused connection failure: Getting https://www.submarino.com.br/produto/133739829/game-naruto-to-boruto-shinobi-striker-day-one-ps4 took longer than 500 seconds..
I had a similar problem with these websites using Google Apps Script (UrlFetchApp).. and also using requests/BeaultifulSoup python package. Inside GAS, it was not possible to fetch the website, but with Python requests/BeaultifulSoup I could find a workaround using a User-Agent inside Headers request.
Now I am using Scrapy, and locally runs fine, even without User-Agents, but running on Scrapy Cloud gives this timeout error. Actually, is very rare, but once or twice it works and ScrapingHub is able to scrap those sites. But 99% of the attempts, it gets timed out.
I already tried to:
1- Added User-Agent and all request headers "needed" by those websites. I even used this site to extract all Headers and Cookies used in cURL requests.
2- Added cookies using DevTool's Application Tab and using the website mentioned above.
3- Erased the links parameters after '?'
4- Increased DOWNLOAD_TIMEOUT and decreased CONCURRENT_REQUESTS
5- Disabled and enabled AUTOTHROTTLE
6- Enabled UserAgentMiddleware, even after tried to change USER_AGENT settings inside ScrapingHub interface.
7- Enabled and disabled ROBOTSTXT_OBEY
8- Tried to add 'cookiejar' inside request.meta
9- I cant even remember all 'solutions' I tried, Im stuck with this for a while.
I dont think im getting IP banned, because I requested A LOT these websites outside ScrapingHub. As I said before, between many many attempts, 3 or 4 times I got one of the links succesfully requested, but its VERY rare.
The websites are: americanas.com.br, submarino.com.br and shoptime.com.br
As I said before, they use the same mechanics: Akamai's services, as said in this post.
Im using ScrapingHub's Scrapy Cloud to host my python Scrapy Project.
The spider runs fine when I run locally, but on ScrapinHub, 3 specific websites (they are 3 E-commerce stores from the same group, using the same website mechanics) times out. Like this:
Now I am using Scrapy, and locally runs fine, even without User-Agents, but running on Scrapy Cloud gives this timeout error. Actually, is very rare, but once or twice it works and ScrapingHub is able to scrap those sites. But 99% of the attempts, it gets timed out.
I already tried to:
1- Added User-Agent and all request headers "needed" by those websites. I even used this site to extract all Headers and Cookies used in cURL requests.
2- Added cookies using DevTool's Application Tab and using the website mentioned above.
3- Erased the links parameters after '?'
4- Increased DOWNLOAD_TIMEOUT and decreased CONCURRENT_REQUESTS
5- Disabled and enabled AUTOTHROTTLE
6- Enabled UserAgentMiddleware, even after tried to change USER_AGENT settings inside ScrapingHub interface.
7- Enabled and disabled ROBOTSTXT_OBEY
8- Tried to add 'cookiejar' inside request.meta
9- I cant even remember all 'solutions' I tried, Im stuck with this for a while.
I dont think im getting IP banned, because I requested A LOT these websites outside ScrapingHub. As I said before, between many many attempts, 3 or 4 times I got one of the links succesfully requested, but its VERY rare.
The websites are: americanas.com.br, submarino.com.br and shoptime.com.br
As I said before, they use the same mechanics: Akamai's services, as said in this post.
I still think its related to Headers and User-Agent issues, as also said in this post.
I really need some help here, any ideas are very helpful.
PS: Attached last Log file.
Attachments (1)
logpyrodarps....txt
54.2 KB
2 Votes
0 Comments
Login to post a comment