scrapy-splash crawler starts fast but slows down (not throttled by website)

Posted over 5 years ago by Stephen Madison

Post a topic
Un Answered
S
Stephen Madison

I have a single crawler written in scrapy using the splash browser via the scrapy-splash python package. I am using the aquarium python package to load balance the parallel scrapy requests to a splash docker cluster.

The scraper uses a long list of urls as the start_urls list. There is no "crawling" from page to page via hrefs or pagination. 

I am running six splash dockers with 5 slots per splash as the load balanced browser cluster. I am running scrapy at six concurrent requests.

The dev machine is a macbook pro with a dual core 2.4Ghz CPU with 16Gb RAM.

When the spider starts up, the aquarium stdout shows fast request/responses, the onboard fan spins up and the system is running at 90% used with 10% idle so I am not overloading the system resources. The memory/swap is not exhausted either.

At this time, I get a very slow ~30 pages/minute. After a few minutes, the fans spin down, the system resources are significantly free (>60% idle) and the scrapy log shows every request having a 503 timeout.

When I look at the stdout of the aquarium cluster, there are requests being processed, albeit very slowly compared to when the spider is first invoked. 

If I got to localhost:9050, I do get the splash page after 10 seconds or so, so the load balancer/splash is online.

If I stop the spider and restart it, it starts up normally so this does not seem to be a throttle from the target site as a spider restart would also be throttled but it's not.

I appreciate any insight that the community can offer.

Thanks. 

1 Votes


2 Comments

Sorted by
l

lucaguarro posted about 4 years ago

Similar problem here as well. Nothing figured out?

0 Votes

D

David Kong posted about 5 years ago

I have a similar problem, did you figure it out?

0 Votes

Login to post a comment