Deploying private dependencies to Scrapy Cloud

Modified on Sun, 15 Sep, 2024 at 9:40 AM

This article presents some approaches on how to use private dependencies in your Scrapy Cloud project.


Using requirements.txt


Let's assume your private dependency is located in some git repository.


A straightforward way would be to provide the credentials embedded into the git repo url:


git+https://<user>:<password>@github.com/nsa/secret.git


Another option, if you use Github, would be to issue a Github personal access token and provide it instead like:


git+https://<token>@github.com/nsa/secret.git


The token also provides access to all Git repository and should be treated as a password.


There's another option related with requirements.txt although it requires some development: you can launch your own private PyPi server (for example devpi). It can be used like:


--extra-index-url <Repo-URL>
my-pkg=0.0.1


However if you want to keep privacy, you have to enable authorisation for the server, and in the similar way provide some credentials to install your private dependencies in Scrapy Cloud.


Using a custom Docker image

Using a custom Docker image allows customizing a lot of things, including private dependencies.

First approach proposes using SSH keys. Assuming you have:

  1. requirements.txt file contains an entry for the private repository.
  2. a pair of SSH keys: id_rsa  and id_rsa.pub
    * you've added id_rsa.pub  as a deployment key for the private repository
    * you've copied id_rsa  to your project directory
  3. a configured project using a custom Docker image (check this blog post for more details)


Add the following lines before pip install requirements  statement.


RUN mkdir -p /root/.ssh
COPY ./id_rsa /root/.ssh/id_rsa
RUN ssh-keyscan -t rsa github.com > /root/.ssh/known_hosts


This case assumes that the private repository is on github.com. If it's on other domain, you must replace the last line according to the repository domain.


Then you should continue with the guide and deploy your project.


Using a vendor folder is an alternative that doesn't require generating and storing ssh keys in the image (nor in the repository). The idea is simple, just:

  1. clone private dependencies locally under a known subdirectory like vendor or libs
  2. Copy them in Dockerfile
  3. finally reference the folders in requirements.txt


Project structure looks like this:


.
├── Dockerfile
├── requirements.txt
├── scrapinghub.yml
├── scrapy.cfg
├── setup.py
├── scrapyprojectname
│   ├── __init__.py
│   ├── __pycache__
│   ├── items.py
│   ├── middleware.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── supercoolspider.py
└── vendor
    └── myprivatelibrary
        ├── setup.py
        ├── i_am_not_public
        │   ├── __init__.py
        │   ├── remote.py
        │   └── utils.py
        ...


Then requirements.txt looks like:

Scrapy==1.1.0
tldextract==1.0.0
-e vendor/myprivatelibrary
# or simply
vendor/myprivatelibrary


Be sure that vendor path is copied on Dockerfile  and not ignored by .dockerignore  file.


There's another modification of the vending approach: combining it with Git Submodules. The advantages is it matches local development with an image build environment.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article