Running a Scrapy spider

Modified on Fri, 12 Feb, 2021 at 9:26 AM

⚠️ You can also deploy your project from a GitHub repository, without needing shub. Click here for details.

You will need the Zyte command line client to deploy projects into Scrapy Cloud, so install it if you have not done so yet. If you already have it installed, make sure you have the latest version:

$ pip install shub --upgrade

The next step is to deploy your Scrapy project to Scrapy Cloud. You will need your API key and the numeric ID of your Scrapy Cloud project. You can find both of these on your project’s Code & Deploys page. First, run:

$ shub login

to save your API key to a local file (~/.scrapinghub.yml). You can delete it from there anytime via shub logout. Next, run:

$ shub deploy

to be guided through a wizard that will set up the project configuration file (scrapinghub.yml) for you. After you complete the wizard, your project will be uploaded to Scrapy Cloud. You can re-trigger deployment (without having to go through the wizard again) anytime via another call to shub deploy.

Now you can schedule your spider to run on Scrapy Cloud:

$ shub schedule quotes-toscrape

Spider quotes-toscrape scheduled, job ID: 99830/1/1
Watch the log on the command line:
    shub log -f 1/1
or print items as they are being scraped:
    shub items -f 1/1
or watch it running in Zyte's web interface:
    https://app.zyte.com/p/99830/job/1/1

And watch it run (replace 1/1 with the job ID shub gave you on the previous command, you can leave out the project ID):

shub log -f 1/1

Alternatively, you can go to your project page and schedule the spider there:

Then select your spider:

You will be redirected to the project dashboard and you can visually check if your spider is running correctly, the job created, items, requests, etc.

Once finished, the job created will be automatically moved to completed jobs.

To understand some terms, click on the job link (in this case 2/3) and you will be redirected to the job description. Check on the address bar in your browser, suppose you have the next url:

https://app.zyte.com/p/166395/2/3

The information gathered from this address is:

Project_id: 166395
Spider_id: 2
Job_id: 3

If you run this spider again, the only change will be the job_id, which will be 4 in this case.

This is important to avoid confusions between terms like projects, spiders and jobs.

Enjoy creating, deploying and scraping with us!