Here at Zyte we are big fans of Heroku. When people ask what Scrapy Cloud is about we sometimes tell people that "it's like Heroku, but for web crawlers". Having Heroku as a role model is something that is always pushing us towards getting better every single day.


We are sure that most of you folks are also Heroku fans and we heard that many people deploy Scrapy spiders in Heroku. So, here is a comparison of Heroku and Scrapy Cloud to help you decide which one fits better your needs to deploy and run Scrapy spiders.


Let's start by looking at the deployment process at both.


Scrapy Cloud

Assuming you already has a Scrapy Cloud account and a Scrapy Cloud project, all you need to deploy a project to Scrapy Cloud is , the Zyte command line client. You go to your local project's folder and run:


$ shub deploy


Providing the Scrapy Cloud project ID when asked and follow the steps. That's it, after doing this you can go to the project dashboard at Scrapy Cloud and manage your crawler there.



Scrapyd via Heroku


Here we have some options:

  • Deploy only the Scrapy project, running the spiders via cmdline (heroku run)

  • Deploy the Scrapy project and build a web UI to control spiders execution

  • Deploy the Scrapy project and Scrapyd, a service to run Scrapy spiders


We decided to follow the last approach. It's the one who gets closer to Scrapy Cloud, because Scrapyd provides an HTTP API to manage spider's execution and also a very simple web UI that you can use to view logs, jobs information and the extracted data. This way we have an effective interface to our spiders and we don't don't have to reinvent the wheel.


scrapy-heroku

The Scrapy community is awesome. Thanks to their restless work, we can combine Scrapyd (an open source project) with Heroku via scrapy-heroku. Be default, Heroku doesn't support Scrapyd because the latter depends on sqlite3, which can't be used in the former. The scrapy-heroku package overcomes this issue by including PostgreSQL support to Scrapyd.


Now, let's go through the steps required to deploy a Scrapy crawler + Scrapyd on Heroku.


Walkthrough

Assuming that you already have created a Heroku account and a Heroku app, the next step is to set up a Postgres database, which is as simple as enabling an addon in the Resources tab.


Deployment options

Heroku provides three different ways to deploy your crawlers:


  • Heroku Git: a git repository with Heroku as the upstream repo

  • Github: automatic deploys the project for every new changeset pushed to Github.

  • Dropbox: grabs the code from a dropbox folder and allow you to deploy it via Web UI


We opted for the first one and the steps are really straightforward: setup a local git repo, install the Heroku Toolbelt and use it to add Heroku as the upstream repository. Then, every changeset you push to Heroku will trigger a new build.


Preparing the Project

We will use the crawler from this repository as the example for this walkthrough. There are some things that you have to change in your project in order to make it work with Scrapyd on Heroku.


1. Add dependencies to your project's requirements.txt. Heroku will automatically deploy those dependencies when you trigger a build:


scrapy
scrapyd
scrapy-heroku


2. Configure Scrapyd in your project's scrapy.cfg:


[scrapyd]
application = scrapy_heroku.app.application


[deploy]
url = http://<YOUR_HEROKU_APP_NAME>.herokuapp.com:80/
project = <YOUR_SCRAPY_PROJECT_NAME>
username = <A_USER_NAME>
password = <A_PASSWORD>


3. Create a file called Procfile in your project's root containing the command that we want to execute when our app is started:


web: scrapyd


Deploying the project

To deploy the project, first make sure that heroku toolbelt is installed in your machine. Then, you will use it to include a new remote in your current project's Git repo:


$ heroku login
$ heroku git:remote -a your-app-name


After you did that, you have to commit and push your changes to Heroku, triggering the build process:


$ git add .
$ git commit -m "settings to run with scrapyd on heroku"
$ git push heroku master


Using Scrapyd

Now, if you go to http://your-app-name.herokuapp.com, you should see Scrapyd welcome page:



Running a Spider

In order to run a spider with Scrapyd, you have to make a an API call:


$ curl http://your-app-name.heroku.com/schedule.json -d project=your-scrapy-project -d spider=somespider


Checking the Running Jobs


Scrapyd provides a very simple web UI where you can see some things like Jobs, Items and Logs. The UI is more a report than a control panel, because you can't control your spiders or configurations using it. All you can do is see what's happening by manually refreshing its pages:



The Scrapyd API is very limited. It doesn't even allow you to retrieve the scraped items. Perhaps the most useful endpoints are schedule.json and cancel.json, because you'll use them to control your spiders execution.


You can view the downloaded items by clicking in the job's "Items" label.



Scrapy Cloud vs Heroku


Here is a side by side comparison of the features:



Scrapy Cloud
Scrapyd (Heroku)
Spider management
  • Management via full-featured dashboard

  • Via Scrapyd web UI (very limited)

Job management

API call, UI, command line tool or client library


Scrapyd HTTP API call

Pricing
  • Free (one unit, 1 hour job limit)

  • $9,00 per unit (1 GB ram, 2.5GB of disk)

  • Free (half dyno)

  • Teams:

    • Free: up to 5 users

    • $25.00/month: 6 - 25 users