Sharing data between spiders

Modified on Mon, 28 Nov, 2022 at 3:22 PM

If you need to provide data to a spider within a given project, you can use the API, or the python-scrapinghub library to store the data in collections.

You can use collections to store an arbitrary number of records which are indexed by a key. Projects often use them as a single location to write data from multiple jobs.

The example below shows how you can create a collection and add some data:

$ curl -u APIKEY: -X POST  -d '{"_key": "first_name", "value": "John"}{ "_key": "last_name", "value": "Doe"}'  https://storage.zyte.com/collections/79855/s/form_filling

To retrieve the data, you would then simply do:

$ curl -u APIKEY: -X GET  "https://storage.zyte.com/collections/79855/s/form_filling?key=first_name&key=last_name"
{"value":"John"}
{"value":"Doe"}

And finally, you can delete the data by sending a DELETE request:

$ curl -u APIKEY: -X DELETE "https://storage.zyte.com/collections/79855/s/form_filling"

Using python-scrapinghub programatically

As mentioned before, the python-scrapinghub library can be used to handle the API calls programatically. Here's a sample code that shows how to use the library within a simple python script:

scrapinghub import ScrapinghubClient

API_KEY = 'APIKEY'
PROJECT_ID = '12345'
COLLECTION 'collection-name'

client = ScrapinghubClient(API_KEY)
project = client.get_project(PROJECT_ID)
collection = project.collections.get_store(COLLECTION)
collection.set({
  '_key': '002d050ee3ff6192dcbecc4e4b4457d7',
  'value': '1447221694537'
})

collections.get('002d050ee3ff6192dcbecc4e4b4457d7')
# Returns {'value': '1447221694537'}

collections.iter()
# Returns a Generator object

You can find more information about the library's full API within its documentation