Start a new topic

Timeout error using with specific websites, tried everything

 

Im using ScrapingHub's Scrapy Cloud to host my python Scrapy Project.

The spider runs fine when I run locally, but on ScrapinHub, 3 specific websites (they are 3 E-commerce stores from the same group, using the same website mechanics) times out. Like this:

[scrapy.downloadermiddlewares.retry] Retrying <GET https://www.submarino.com.br/produto/133739829/game-naruto-to-boruto-shinobi-striker-day-one-ps4> (failed 1 times): User timeout caused connection failure: Getting https://www.submarino.com.br/produto/133739829/game-naruto-to-boruto-shinobi-striker-day-one-ps4 took longer than 500 seconds..
 I had a similar problem with these websites using Google Apps Script (UrlFetchApp).. and also using requests/BeaultifulSoup python package. Inside GAS, it was not possible to fetch the website, but with Python requests/BeaultifulSoup I could find a workaround using a User-Agent inside Headers request.

Now I am using Scrapy, and locally runs fine, even without User-Agents, but running on Scrapy Cloud gives this timeout error. Actually, is very rare, but once or twice it works and ScrapingHub is able to scrap those sites. But 99% of the attempts, it gets timed out.

I already tried to:

1- Added User-Agent and all request headers "needed" by those websites. I even used this site to extract all Headers and Cookies used in cURL requests.

2- Added cookies using DevTool's Application Tab and using the website mentioned above.

3- Erased the links parameters after '?'

4- Increased DOWNLOAD_TIMEOUT and decreased CONCURRENT_REQUESTS

5- Disabled and enabled AUTOTHROTTLE

6- Enabled UserAgentMiddleware, even after tried to change USER_AGENT settings inside ScrapingHub interface.

7- Enabled and disabled ROBOTSTXT_OBEY

8- Tried to add 'cookiejar' inside request.meta

9- I cant even remember all 'solutions' I tried, Im stuck with this for a while.

I dont think im getting IP banned, because I requested A LOT these websites outside ScrapingHub. As I said before, between many many attempts, 3 or 4 times I got one of the links succesfully requested, but its VERY rare.

The websites are: americanas.com.br, submarino.com.br and shoptime.com.br

As I said before, they use the same mechanics: Akamai's services, as said in this post.

I still think its related to Headers and User-Agent issues, as also said in this post.

I really need some help here, any ideas are very helpful.

PS: Attached last Log file.


txt

2 people have this question
Login to post a comment