Deploying Python Dependencies for Your Projects in Scrapy Cloud

Modified on Thu, 11 Feb, 2021 at 6:34 PM

The environment where your spiders run on Scrapy Cloud brings a set of pre-installed packages. However, sometimes you'll need some extra packages that might not be available by default. If that's your case, follow this article to learn how to deploy those extra packages to Scrapy Cloud.

First, make sure you're using the latest version of shub, the Zyte command line tool. You can do so by downloading the latest release binary or, if you are a pip user, run this:

$ pip install shub --upgrade

And here's what you have to do in order to deploy Python dependencies for your projects:

Create a requirements.txt file containing your extra dependencies in your project's root folder.
Set that file as the requirements value in your project's scrapinghub.yml configuration file.
Deploy your project with the dependencies.

1. Creating the requirements.txt file

This is a regular text file where you list the Python packages that your project depends on, one package per line. For example:

js2xml==0.2.1
extruct==0.1.0
requests==2.6.0

You should always pin down the specific versions for each of your dependencies, as we did in the example above. By doing this, you avoid the trouble of getting your spiders broken due to unexpected upgrades.

2. Configuring your dependencies in scrapinghub.yml

After creating the requirements file, add the requirements setting to scrapinghub.yml and point it to your project's requirements.txt path:

projects:
  default: 12345
requirements:
  file: requirements.txt

Note: if there's no scrapinghub.yml file in your project folder, you should run shub deploy once to generate it.

3. Deploying your project

Now that you've set your project dependencies, it's time to deploy your project. Just run a regular shub deploy and then you should be able to use your spiders with extra dependencies in Scrapy Cloud.

$ shub deploy

Frequently Asked Questions

The library I need is not available on PyPI. What Can I do?

Let's say your project depends on the coolest-lib-ever library. However, the library maintainer doesn't provide any PyPI package for it (i.e., you can't pip install it). To overcome that, add the library's GitHub/BitBucket repository address into your project's requirements.txt. Like so:

git+https://github.com/coolest-dev/coolest-lib-ever.git

You can also point your requirements file to a specific tag/branch from that repository:

git+https://github.com/coolest-dev/coolest-lib-ever.git@v1.0.1

After that, re-deploy your project.

Can I deploy private packages to Scrapy Cloud?

Yes, you can. Check out this article for more information: Deploying Private Dependencies to Scrapy Cloud.

What does an "Internal build error" mean?

It probably means that your project is trying to import a module that is not available by default on Scrapy Cloud. Look for lines starting with ImportError in the logs printed by shub deploy and then you'll find the offending library.

ImportError: No module named extruct.w3cmicrodata
{"message": "List exit code: 1", "details": null, "error": "build_error"}
{"status": "error", "message": "Internal build error"}
Error: Deploy failed: '{"status": "error", "message": "Internal build error"}'

Alternatively, you can also find the build logs on your project's Code & Deploys page on Scrapy Cloud:

How can I deploy non-Python dependencies to Scrapy Cloud?

If your project depends on non-Python requirements such as binaries, you have to deploy a custom Docker image with your packages to Scrapy Cloud.

Things to keep in mind

Don't add requirements in editable mode

Zyte doesn't support package installation in editable mode (also known as setuptools develop mode), e.g.:

-e https://github.com/scrapinghub/extruct/archive/10cbb3a.zip#egg=extruct==0.1.0

If your requirements.txt contains a line starting with -e, please remove this prefix:

https://github.com/scrapinghub/extruct/archive/10cbb3a.zip#egg=extruct==0.1.0

Specify each requirement version

The build process aggressively caches requirements, so pointing to a non-specific version of your requirement is not a good idea, as you can't be sure which version of your code is going to be built.

Good practice:

js2xml==0.2.1
extruct==0.1.0
requests==2.6.0
git+git://github.com/scrapinghub/extruct@10cbb3a#egg=extruct==0.1.0

Bad practice:

js2xml
extruct
requests
git+git://github.com/scrapinghub/extruct@10cbb3a#egg=extruct