How to launch a large-scale web scraping project? Find out how LexisNexis did it. Join the webinar on 29th March.Register now
Start a new topic

5 (technical) questions about scrapy cloud and scrapy


First, I really fell in love with Scrapy! Although I only experimented with it for about a week now, it really makes my life much easier. Thank you!

Second, to learn more about Scrapy and Scrapy Cloud, may I ask you a few questions that I haven't found an appropriate answer to yet?

1. How do I access the folder .scrapy/httpcache in persistent storage at Scrapy Cloud? When I write information to a file in the Python script, where is this information stored at Scrapy Cloud?

2. An ethical and legal question: What is an acceptable rate to crawl the same domain/ip per minute? My current rate is high I think (> 100 requests per second). I however do not want to do harm to the website owner.

3. What would be the best way to convert information from an API that is in JSON format to a class object, knowing that the JSON data has many keys and also contains many nested keys? Will it be best to write all classes and their relationships, or is there an easier way in Scrapy that I haven't found out yet?

4. Are there any good tutorials and books, or other learning material, available?

5. What is best practice for using different parse methods to extract data on the same object from multiple pages? Until now, I have used added a meta keyword argument to all requests, but since these kwargs are passed onto about 6 or 7 additional methods, I don't know for sure if this is the best way to achieve my goal. Eventually, only the last method will yield the item, which may consume much RAM and may be inefficient.

Well, there's much to learn for me I guess. Thanks a lot guys !

1 Comment

Sorry, but to avoid confusion: I made a typo, "> 100 requests per second" should of course be "> 100 requests per minute".

Login to post a comment