I have a single crawler written in scrapy using the splash browser via the scrapy-splash python package. I am using the aquarium python package to load balance the parallel scrapy requests to a splash docker cluster.
The scraper uses a long list of urls as the start_urls list. There is no "crawling" from page to page via hrefs or pagination.
I am running six splash dockers with 5 slots per splash as the load balanced browser cluster. I am running scrapy at six concurrent requests.
The dev machine is a macbook pro with a dual core 2.4Ghz CPU with 16Gb RAM.
When the spider starts up, the aquarium stdout shows fast request/responses, the onboard fan spins up and the system is running at 90% used with 10% idle so I am not overloading the system resources. The memory/swap is not exhausted either.
At this time, I get a very slow ~30 pages/minute. After a few minutes, the fans spin down, the system resources are significantly free (>60% idle) and the scrapy log shows every request having a 503 timeout.
When I look at the stdout of the aquarium cluster, there are requests being processed, albeit very slowly compared to when the spider is first invoked.
If I got to localhost:9050, I do get the splash page after 10 seconds or so, so the load balancer/splash is online.
If I stop the spider and restart it, it starts up normally so this does not seem to be a throttle from the target site as a spider restart would also be throttled but it's not.
I appreciate any insight that the community can offer.
Thanks.
1 Votes
2 Comments
Sorted by
D
David Kongposted
about 5 years ago
I have a similar problem, did you figure it out?
0 Votes
l
lucaguarroposted
about 4 years ago
Similar problem here as well. Nothing figured out?
I have a single crawler written in scrapy using the splash browser via the scrapy-splash python package. I am using the aquarium python package to load balance the parallel scrapy requests to a splash docker cluster.
The scraper uses a long list of urls as the start_urls list. There is no "crawling" from page to page via hrefs or pagination.
I am running six splash dockers with 5 slots per splash as the load balanced browser cluster. I am running scrapy at six concurrent requests.
The dev machine is a macbook pro with a dual core 2.4Ghz CPU with 16Gb RAM.
When the spider starts up, the aquarium stdout shows fast request/responses, the onboard fan spins up and the system is running at 90% used with 10% idle so I am not overloading the system resources. The memory/swap is not exhausted either.
At this time, I get a very slow ~30 pages/minute. After a few minutes, the fans spin down, the system resources are significantly free (>60% idle) and the scrapy log shows every request having a 503 timeout.
When I look at the stdout of the aquarium cluster, there are requests being processed, albeit very slowly compared to when the spider is first invoked.
If I got to localhost:9050, I do get the splash page after 10 seconds or so, so the load balancer/splash is online.
If I stop the spider and restart it, it starts up normally so this does not seem to be a throttle from the target site as a spider restart would also be throttled but it's not.
I appreciate any insight that the community can offer.
Thanks.
1 Votes
2 Comments
David Kong posted about 5 years ago
I have a similar problem, did you figure it out?
0 Votes
lucaguarro posted about 4 years ago
Similar problem here as well. Nothing figured out?
0 Votes
Login to post a comment