Deploying private dependencies to Scrapy Cloud

Modified on Sun, 15 Sep, 2024 at 9:40 AM

This article presents some approaches on how to use private dependencies in your Scrapy Cloud project.

Using requirements.txt

Let's assume your private dependency is located in some git repository.

A straightforward way would be to provide the credentials embedded into the git repo url:

git+https://<user>:<password>@github.com/nsa/secret.git

Another option, if you use Github, would be to issue a Github personal access token and provide it instead like:

git+https://<token>@github.com/nsa/secret.git

The token also provides access to all Git repository and should be treated as a password.

There's another option related with requirements.txt although it requires some development: you can launch your own private PyPi server (for example devpi). It can be used like:

--extra-index-url <Repo-URL>
my-pkg=0.0.1

However if you want to keep privacy, you have to enable authorisation for the server, and in the similar way provide some credentials to install your private dependencies in Scrapy Cloud.

Using a custom Docker image

Using a custom Docker image allows customizing a lot of things, including private dependencies.

First approach proposes using SSH keys. Assuming you have:

requirements.txt file contains an entry for the private repository.
a pair of SSH keys: id_rsa and id_rsa.pub
* you've added id_rsa.pub as a deployment key for the private repository
* you've copied id_rsa to your project directory
a configured project using a custom Docker image (check this blog post for more details)

Add the following lines before pip install requirements statement.

RUN mkdir -p /root/.ssh
COPY ./id_rsa /root/.ssh/id_rsa
RUN ssh-keyscan -t rsa github.com > /root/.ssh/known_hosts

This case assumes that the private repository is on github.com. If it's on other domain, you must replace the last line according to the repository domain.

Then you should continue with the guide and deploy your project.

Using a vendor folder is an alternative that doesn't require generating and storing ssh keys in the image (nor in the repository). The idea is simple, just:

clone private dependencies locally under a known subdirectory like vendor or libs
Copy them in Dockerfile
finally reference the folders in requirements.txt

Project structure looks like this:

.
├── Dockerfile
├── requirements.txt
├── scrapinghub.yml
├── scrapy.cfg
├── setup.py
├── scrapyprojectname
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middleware.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── supercoolspider.py
└── vendor
    └── myprivatelibrary
        ├── setup.py
        ├── i_am_not_public
        │   ├── __init__.py
        │   ├── remote.py
        │   └── utils.py
        ...

Then requirements.txt looks like:

Scrapy==1.1.0
tldextract==1.0.0
-e vendor/myprivatelibrary
# or simply
vendor/myprivatelibrary

Be sure that vendor path is copied on Dockerfile and not ignored by .dockerignore file.

There's another modification of the vending approach: combining it with Git Submodules. The advantages is it matches local development with an image build environment.