Normally, I would expect that requesting items with an `:item_no` greater than the number of items fetched in total for the job would return an empty list of results. Indeed, this is the case for a job in spider A. However, with spider B, requesting items with an offset greater than the number of items results in a "loop" where it seems the same set of items is returned continually.
Why is this a problem? I'm retrieving the scraped items for processing by batches. After each batch is processed, I increment the `:item_no` offset: when no results are returned by the API, I've processed all the items in the job. However, with this API bug retrieving items in batches won't work, as it will simply loop forever.
Where 123456 is my project number, and job 123456/1/47 contains 21499 items in total. Note that this issue happens if the `count` parameter is changed. However, I've only seen it happen on one of my spiders so far (the other works as expected, returning an empty list of items once the offset is high enough).
0 Votes
2 Comments
Sorted by
D
David Sulcposted
over 5 years ago
Sorry, I forgot to mention that in my original post: I am indeed only running this for completed jobs, yet still get the buggy results returned.
0 Votes
aurish_hammad_hafeezposted
over 5 years ago
Admin
I think your approach will case problem when job is in running state because the counter will keep on increasing. I would suggest you to only run your script for the completed jobs
There seems to be a bug in the items API, specifically this endpoint: https://doc.scrapinghub.com/api/items.html#items-project-id-spider-id-job-id-item-no-field-name
Normally, I would expect that requesting items with an `:item_no` greater than the number of items fetched in total for the job would return an empty list of results. Indeed, this is the case for a job in spider A. However, with spider B, requesting items with an offset greater than the number of items results in a "loop" where it seems the same set of items is returned continually.
Why is this a problem? I'm retrieving the scraped items for processing by batches. After each batch is processed, I increment the `:item_no` offset: when no results are returned by the API, I've processed all the items in the job. However, with this API bug retrieving items in batches won't work, as it will simply loop forever.
The URl I'm calling is:
https://storage.scrapinghub.com/items/123456?format=json&count=10&start=123456%2F1%2F47%2F900000
Where 123456 is my project number, and job 123456/1/47 contains 21499 items in total. Note that this issue happens if the `count` parameter is changed. However, I've only seen it happen on one of my spiders so far (the other works as expected, returning an empty list of items once the offset is high enough).
0 Votes
2 Comments
David Sulc posted over 5 years ago
Sorry, I forgot to mention that in my original post: I am indeed only running this for completed jobs, yet still get the buggy results returned.
0 Votes
aurish_hammad_hafeez posted over 5 years ago Admin
I think your approach will case problem when job is in running state because the counter will keep on increasing. I would suggest you to only run your script for the completed jobs
0 Votes
Login to post a comment