scrapy-splash crawler starts fast but slows down (not throttled by website)

Posted almost 6 years ago by Stephen Madison

Post a topic

Un Answered

Stephen Madison

I have a single crawler written in scrapy using the splash browser via the scrapy-splash python package. I am using the aquarium python package to load balance the parallel scrapy requests to a splash docker cluster.

The scraper uses a long list of urls as the start_urls list. There is no "crawling" from page to page via hrefs or pagination.

I am running six splash dockers with 5 slots per splash as the load balanced browser cluster. I am running scrapy at six concurrent requests.

The dev machine is a macbook pro with a dual core 2.4Ghz CPU with 16Gb RAM.

When the spider starts up, the aquarium stdout shows fast request/responses, the onboard fan spins up and the system is running at 90% used with 10% idle so I am not overloading the system resources. The memory/swap is not exhausted either.

At this time, I get a very slow ~30 pages/minute. After a few minutes, the fans spin down, the system resources are significantly free (>60% idle) and the scrapy log shows every request having a 503 timeout.

When I look at the stdout of the aquarium cluster, there are requests being processed, albeit very slowly compared to when the spider is first invoked.

If I got to localhost:9050, I do get the splash page after 10 seconds or so, so the load balancer/splash is online.

If I stop the spider and restart it, it starts up normally so this does not seem to be a throttle from the target site as a spider restart would also be throttled but it's not.

I appreciate any insight that the community can offer.

Thanks.

1 Votes

2 Comments

David Kong posted over 5 years ago

I have a similar problem, did you figure it out?

0 Votes

lucaguarro posted over 4 years ago

Similar problem here as well. Nothing figured out?

0 Votes