This article presents some approaches on how to use private dependencies in your Scrapy Cloud project.
Using requirements.txt
Let's assume your private dependency is located in some git repository.
A straightforward way would be to provide the credentials embedded into the git repo url:
git+https://<user>:<password>@github.com/nsa/secret.git
Another option, if you use Github, would be to issue a Github personal access token and provide it instead like:
git+https://<token>@github.com/nsa/secret.git
The token also provides access to all Git repository and should be treated as a password.
There's another option related with requirements.txt although it requires some development: you can launch your own private PyPi server (for example devpi). It can be used like:
--extra-index-url <Repo-URL> my-pkg=0.0.1
However if you want to keep privacy, you have to enable authorisation for the server, and in the similar way provide some credentials to install your private dependencies in Scrapy Cloud.
Using a custom Docker image
Using a custom Docker image allows customizing a lot of things, including private dependencies.
First approach proposes using SSH keys. Assuming you have:
- requirements.txt file contains an entry for the private repository.
- a pair of SSH keys:
id_rsa
andid_rsa.pub
* you've addedid_rsa.pub
as a deployment key for the private repository
* you've copiedid_rsa
to your project directory - a configured project using a custom Docker image (check this blog post for more details)
Add the following lines before pip install requirements
statement.
RUN mkdir -p /root/.ssh COPY ./id_rsa /root/.ssh/id_rsa RUN ssh-keyscan -t rsa github.com > /root/.ssh/known_hosts
This case assumes that the private repository is on github.com. If it's on other domain, you must replace the last line according to the repository domain.
Then you should continue with the guide and deploy your project.
Using a vendor folder is an alternative that doesn't require generating and storing ssh keys in the image (nor in the repository). The idea is simple, just:
- clone private dependencies locally under a known subdirectory like
vendor
orlibs
- Copy them in
Dockerfile
- finally reference the folders in
requirements.txt
Project structure looks like this:
. ├── Dockerfile ├── requirements.txt ├── scrapinghub.yml ├── scrapy.cfg ├── setup.py ├── scrapyprojectname │ ├── __init__.py │ ├── __pycache__ │ ├── items.py │ ├── middleware.py │ ├── settings.py │ └── spiders │ ├── __init__.py │ ├── supercoolspider.py └── vendor └── myprivatelibrary ├── setup.py ├── i_am_not_public │ ├── __init__.py │ ├── remote.py │ └── utils.py ...
Then requirements.txt looks like:
Scrapy==1.1.0 tldextract==1.0.0 -e vendor/myprivatelibrary # or simply vendor/myprivatelibrary
Be sure that vendor path is copied on Dockerfile
and not ignored by .dockerignore
file.
There's another modification of the vending approach: combining it with Git Submodules. The advantages is it matches local development with an image build environment.