We have recently moved to using Scrapy Cloud (and its API) as a decoupled crawling engine. We are using the Scrapy cloud API to:
- start crawls jobs a site
- validate crawl finish (and then)
- retrieve content for a particular job.
Since we are integrating from a Java application, I have questions on response times we should expect on these requests and API request limits (as I think we may be hitting them)
Is there any documentation on what errors look like, expected error status codes, or request limits?
A sample integration diagram can be found below along errors
Please see some of the "Java" errors we are reporting in our logs and what I believe is the cause. Any help you can provide on this is helpful
On Crawl Submission:
- Connection prematurely closed BEFORE response
I believe this to be that we timeout after N seconds but ScrapyCloud Run API took >N seconds to respond
On Crawl Validation:
- connection timed out: app.scrapinghub.com/22.214.171.124:443
I believe this is could not connect to Scrapy Cloud API
- Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
-Failed to decode:Unrecognized token 'Too':
I believe this is because the API request for getting job stats returned something like '<some error text>' or 'Too many requests'