Scrapy Cloud Technical Details

Posted over 3 years ago by Jacob Moore

Post a topic

Un Answered

Jacob Moore

I'm currently underway with a fairly heavy web scraping project, which involves a blind traversal of a few thousand domains in order to find certain downloadable files somewhere therein. My questions are as follows:

I noticed that the Smart Proxy Manager is automatically integrated with spiders deployed through Scrapy Cloud. Are there restrictions on the number of concurrent requests permitted for a given spider? I would need at least 25-50 concurrent requests/minute.
On a similar chord, is there a bandwidth limit on requests?
The way in which I traverse websites uses an informed heuristic based on keywords; based on this heuristic, I perform a best first search through the website in order to minimize the work needed to find the files I want. Is Scrapy capable of traversing documents in this atypical way? It seems as though Scrapy spiders have a somewhat rigid structure, and I'm skeptical as to whether it can work in the way I want.

If anyone is able to advise me on these issues I've presented, I'd appreciate it a lot.

Much thanks.

0 Votes

0 Comments