⚠️  Note: Headless browser support is now available in all the plans.


Crawling with a headless browser is different from traditional approaches. Conventional spiders give you control over the requests and sequences of requests. Browser-less spiders integrate network/transport level into the framework. Roughly speaking, browser-less spiders program transport level to traverse a website in a specific way. You control how they will access static assets (if you need them) and how they load resources which are additions to the plain HTML which comes as a response (please note that requests to static assets count towards the quota of requests defined for each Zyte Smart Proxy Manager(formerly Crawlera) plan, unless explicitly filtered out in the spider).

Browsers do not give you such flexibility. They can do much more than browser-less spiders (render DOM changes, execute JavaScript, establish WebSocket connections), but usually, you cannot control them when it comes to loading of static assets, queuing of such requests, or applying additional logic specific to web scraping. For example, it is hard to force browsers to exclude files based on file extensions, ignore certain paths, provide HTTP Basic Auth credentials, modify cookie jar on the fly or append special headers, and so on. Browsers were created to show the resources to end users and their API is quite limited.

Smart Proxy Manager is the HTTP proxy which supports Proxy Authorization protocol and configured by the special X-Headers. It is fine for browser-less spiders which usually come with a straightforward way of using the service but it is really tricky to configure headless browsers to use Smart Proxy Manager.

Smart Proxy Manager knows how to work with browser workloads but to simplify configuration, we recommend using crawlera-headless-proxy, which is the self-hosted complimentary tool that separates the proxy interface from Smart Proxy Manager configuration.

This tutorial will cover crawlera-headless-proxy installation, usage, and configuration. We also provide a list of examples for how to use some headless browsers with this tool.


crawlera-headless-proxy is a complimentary proxy which is distributed as a statically linked binary. This tool was created to be a self-hosted service which you should tailor to the grid of your headless browsers. 

The main idea is to delegate Smart Proxy Manager configuration to headless-proxy. After that, it has to provide the simplest HTTP proxy interface to a headless browser so users do not have to worry how to propagate Smart Proxy Manager settings to the browser itself.

Also, crawlera-headless-proxy provides a list of common features which usually are required for web scraping.

Please pay attention to the fact that crawlera-headless-proxy does MITM. Unfortunately, there is no way to bypass it if we need to hijack the secure requests to append headers to them, or filter by an adblock list. You can use embedded TLS keys or provide your own.


Source code of this tool is available on GitHub: https://github.com/scrapinghub/crawlera-headless-proxy

It is also available in DockerHub here: https://hub.docker.com/r/scrapinghub/crawlera-headless-proxy/

Pre-made binaries

crawlera-headless-proxy is distributed as a statically linked binary with no runtime dependencies. To obtain the latest version, please check the Releases page on GitHub: https://github.com/scrapinghub/crawlera-headless-proxy/releases

If you use OS X and HomeBrew, please install with

$ brew install https://raw.githubusercontent.com/scrapinghub/crawlera-headless-proxy/master/crawlera-headless-proxy.rb

Also, there is a Docker image with headless-proxy. To obtain it, please execute the following command:

$ docker pull scrapinghub/crawlera-headless-proxy

Installation from source code

To install from sources, please refer to the official README on https://github.com/scrapinghub/crawlera-headless-proxy.


crawlera-headless-proxy can be configured by a number of ways:

  1. Config file (in TOML)
  2. Command line parameters
  3. Environment variables

You can find a comprehensive example with all options here: https://github.com/scrapinghub/crawlera-headless-proxy/blob/master/config.toml

For all options, their meaning and configuration details please see official README: https://github.com/scrapinghub/crawlera-headless-proxy.


crawlera-headless-proxy provides a sensible defaults so minimal example is

$ crawlera-headless-proxy -a MYAPIKEY

where MYAPIKEY is an API key for Smart Proxy Manager which you can find on the page of your Smart Proxy Manager user.

The same example with docker:

$ docker run --rm -it --name crawlera-headless-proxy scrapinghub/crawlera-headless-proxy -a MYAPIKEY

Or, if you want to propagate API key with environment variable:

$ docker run --rm -it --name crawlera-headless-proxy -e CRAWLERA_HEADLESS_APIKEY=MYAPIKEY Zyte/crawlera-headless-proxy

If you like to use configuration file only, please run the tool with the following command line:

$ crawlera-headless-proxy -c /path/to/my/config/file.toml

If you want to use docker, please mount it to /config.toml within a container. Example:

$ docker run --rm -it -- name -v /path/to/my/config/file.toml:/config.toml:ro scrapinghub/crawlera-headless-proxy

Headless Browsers options

There are several ways of how to use headless browsers. The most common way is to use Selenium or Puppeteer. Another great option is Splash. Please choose the way you like.

You can find examples of how to use headless browsers with Smart Proxy Manager and crawlera-headless-proxy in examples directory: https://github.com/scrapinghub/crawlera-headless-proxy/tree/master/examples

Let’s assume that you have crawlera-headless-proxy up and running. For the sake of simplicity, let’s assume it is accessible by IP and port 3128.


Splash is the open source project of Zyte which presents an HTTP API to WebKit. It has the possibility to execute browser automation scripts in Lua stateless.

If you want to render an HTML using Splash with the render.html endpoint, just pass proxy (proxy= along with other parameters. If you need to use Smart Proxy Manager conditionally, you need to use a custom Lua script. Please find the simplest example below:

function main(splash, args)
  if args.proxy_host ~= nil and args.proxy_port ~= nil then
        host = args.proxy_host,
        port = args.proxy_port,
  splash:set_result_content_type("text/html; charset=utf-8")
  return splash:html()

This will activate Smart Proxy Manager for the request if you propagate proxy_host and proxy_port parameters to execute endpoint.


Puppeteer is an official project which provides node.js API for headless Chrome.

const browser = await puppeteer.launch({
    ignoreHTTPSErrors: true,
    args: ["--proxy-server="]


Pyppeteer is an unofficial port of Puppeteer to Python. API is quite similar to JS one:

browser = await pyppeteer.launch(


Selenium is a browser automation project which uses webdriver API. Almost all browsers implement this API so you are not limited to a single project. A browser is configured by Selenium capabilities. Here is an example on how to configure Chrome with Selenium Grid:

profile = webdriver.DesiredCapabilities.CHROME.copy()
profile["proxy"] = {
  "noProxy": None,
  "proxyType": "MANUAL",
  "class": "org.openqa.selenium.Proxy",
  "autodetect": False
profile["acceptSslCerts"] = True
driver = webdriver.Remote("http://localhost:4444/wd/hub", profile)

As you can see, the configuration is quite similar to the other headless browsers. You just need to propagate HTTP proxy settings.

Integration with Scrapy

If you want to integrate Selenium with Scrapy, please use scrapy-selenium plugin:  https://github.com/scrapy-plugins/scrapy-selenium

To install it with pip, please run the following command:

$ pip install -e git+https://github.com/scrapy-plugins/scrapy-selenium.git#egg=scrapy-selenium

Update your settings.py with the following lines:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
SELENIUM_GRID_URL = 'http://localhost:4444/wb/hub'  # Example for local grid with docker-compose
SELENIUM_NODES = 3  # Number of nodes(browsers) you are running on your grid
SELENIUM_CAPABILITIES = DesiredCapabilities.CHROME  # Example for Chrome
SELENIUM_PROXY = 'http://proxy.url:port'
# You need also to change the default download handlers, like so:
    "http": "scrapy_selenium.SeleniumDownloadHandler",
    "https": "scrapy_selenium.SeleniumDownloadHandler",

Example of spider which uses scrapy_selenium:

from scrapy import Spider, Request
from scrapy_selenium import SeleniumRequest
class SomeSpider(Spider):
    def parse(self, response):
        yield Request(url, callback=self.some_parser)  # This will be handled just like any scrapy request
    def some_parser(self, response):
        yield SeleniumRequest(some_url, callback=self.other_parser, driver_callback=self.process_webdriver)  # This will be handled by Selenium Grid
    def process_webdriver(self, driver):