3 things to evaluate to discover the best way for you to access web data.Register for the webinar
Start a new topic

Items API bug

There seems to be a bug in the items API, specifically this endpoint: https://doc.scrapinghub.com/api/items.html#items-project-id-spider-id-job-id-item-no-field-name


Normally, I would expect that requesting items with an `:item_no` greater than the number of items fetched in total for the job would return an empty list of results. Indeed, this is the case for a job in spider A. However, with spider B, requesting items with an offset greater than the number of items results in a "loop" where it seems the same set of items is returned continually.


Why is this a problem? I'm retrieving the scraped items for processing by batches. After each batch is processed, I increment the `:item_no` offset: when no results are returned by the API, I've processed all the items in the job. However, with this API bug retrieving items in batches won't work, as it will simply loop forever.


The URl I'm calling is:


https://storage.scrapinghub.com/items/123456?format=json&count=10&start=123456%2F1%2F47%2F900000


Where 123456 is my project number, and job 123456/1/47 contains 21499 items in total. Note that this issue happens if the `count` parameter is changed. However, I've only seen it happen on one of my spiders so far (the other works as expected, returning an empty list of items once the offset is high enough).


I think your approach will case problem when job is in running state because the counter will keep on increasing. I would suggest you to only run your script for the completed jobs

Sorry, I forgot to mention that in my original post: I am indeed only running this for completed jobs, yet still get the buggy results returned.

Login to post a comment