Overview
Crawling with a headless browser is different from traditional approaches. Conventional spiders give you control over the requests and sequences of requests. Browser-less spiders integrate network/transport level into the framework. Roughly speaking, browser-less spiders program transport level to traverse a website in a specific way. You control how they will access static assets (if you need them) and how they load resources which are additions to the plain HTML which comes as a response (please note that requests to static assets count towards the quota of requests defined for each Zyte Smart Proxy Manager(formerly Crawlera) plan, unless explicitly filtered out in the spider).
Browsers do not give you such flexibility. They can do much more than browser-less spiders (render DOM changes, execute JavaScript, establish WebSocket connections), but usually, you cannot control them when it comes to loading of static assets, queuing of such requests, or applying additional logic specific to web scraping. For example, it is hard to force browsers to exclude files based on file extensions, ignore certain paths, provide HTTP Basic Auth credentials, modify cookie jar on the fly or append special headers, and so on. Browsers were created to show the resources to end users and their API is quite limited.
Smart Proxy Manager is the HTTP proxy which supports Proxy Authorization protocol and configured by the special X-Headers. It is fine for browser-less spiders which usually come with a straightforward way of using the service but it is really tricky to configure headless browsers to use Smart Proxy Manager.
Smart Proxy Manager knows how to work with browser workloads but to simplify configuration, we recommend using crawlera-headless-proxy, which is the self-hosted complimentary tool that separates the proxy interface from Smart Proxy Manager configuration.
This tutorial will cover crawlera-headless-proxy installation, usage, and configuration. We also provide a list of examples for how to use some headless browsers with this tool.
crawlera-headless-proxy
crawlera-headless-proxy is a complimentary proxy which is distributed as a statically linked binary. This tool was created to be a self-hosted service which you should tailor to the grid of your headless browsers.
The main idea is to delegate Smart Proxy Manager configuration to headless-proxy. After that, it has to provide the simplest HTTP proxy interface to a headless browser so users do not have to worry how to propagate Smart Proxy Manager settings to the browser itself.
Also, crawlera-headless-proxy provides a list of common features which usually are required for web scraping.
Please pay attention to the fact that crawlera-headless-proxy does MITM. Unfortunately, there is no way to bypass it if we need to hijack the secure requests to append headers to them, or filter by an adblock list. You can use embedded TLS keys or provide your own.
Installation
Source code of this tool is available on GitHub: https://github.com/scrapinghub/crawlera-headless-proxy
It is also available in DockerHub here: https://hub.docker.com/r/scrapinghub/crawlera-headless-proxy/
Pre-made binaries
crawlera-headless-proxy is distributed as a statically linked binary with no runtime dependencies. To obtain the latest version, please check the Releases page on GitHub: https://github.com/scrapinghub/crawlera-headless-proxy/releases
If you use OS X and HomeBrew, please install with
$ brew install https://raw.githubusercontent.com/scrapinghub/crawlera-headless-proxy/master/crawlera-headless-proxy.rb
Also, there is a Docker image with headless-proxy. To obtain it, please execute the following command:
$ docker pull scrapinghub/crawlera-headless-proxy
Installation from source code
To install from sources, please refer to the official README on https://github.com/scrapinghub/crawlera-headless-proxy.
Configuration
crawlera-headless-proxy can be configured by a number of ways:
- Config file (in TOML)
- Command line parameters
- Environment variables
You can find a comprehensive example with all options here: https://github.com/scrapinghub/crawlera-headless-proxy/blob/master/config.toml
For all options, their meaning and configuration details please see official README: https://github.com/scrapinghub/crawlera-headless-proxy.
Usage
crawlera-headless-proxy provides a sensible defaults so minimal example is
$ crawlera-headless-proxy -a MYAPIKEY
where MYAPIKEY
is an API key for Smart Proxy Manager which you can find on the page of your Smart Proxy Manager user.
The same example with docker:
$ docker run --rm -it --name crawlera-headless-proxy scrapinghub/crawlera-headless-proxy -a MYAPIKEY
Or, if you want to propagate API key with environment variable:
$ docker run --rm -it --name crawlera-headless-proxy -e CRAWLERA_HEADLESS_APIKEY=MYAPIKEY Zyte/crawlera-headless-proxy
If you like to use configuration file only, please run the tool with the following command line:
$ crawlera-headless-proxy -c /path/to/my/config/file.toml
If you want to use docker, please mount it to /config.toml
within a container. Example:
$ docker run --rm -it -- name -v /path/to/my/config/file.toml:/config.toml:ro scrapinghub/crawlera-headless-proxy
Headless Browsers options
There are several ways of how to use headless browsers. The most common way is to use Selenium or Puppeteer. Another great option is Splash. Please choose the way you like.
You can find examples of how to use headless browsers with Smart Proxy Manager and crawlera-headless-proxy in examples directory: https://github.com/scrapinghub/crawlera-headless-proxy/tree/master/examples
Let’s assume that you have crawlera-headless-proxy up and running. For the sake of simplicity, let’s assume it is accessible by IP 10.11.12.13
and port 3128
.
Splash
Splash is the open source project of Zyte which presents an HTTP API to WebKit. It has the possibility to execute browser automation scripts in Lua stateless.
If you want to render an HTML using Splash with the render.html endpoint, just pass proxy
(proxy=http://10.11.12.13:3128
) along with other parameters. If you need to use Smart Proxy Manager conditionally, you need to use a custom Lua script. Please find the simplest example below:
function main(splash, args) if args.proxy_host ~= nil and args.proxy_port ~= nil then splash:on_request(function(request) request:set_proxy{ host = args.proxy_host, port = args.proxy_port, } end) end splash:set_result_content_type("text/html; charset=utf-8") assert(splash:go(args.url)) return splash:html() end
This will activate Smart Proxy Manager for the request if you propagate proxy_host
and proxy_port
parameters to execute endpoint.
Puppeteer
Puppeteer is an official project which provides node.js API for headless Chrome.
const browser = await puppeteer.launch({ ignoreHTTPSErrors: true, args: ["--proxy-server=10.11.12.13:3128"] });
Pyppeteer
Pyppeteer is an unofficial port of Puppeteer to Python. API is quite similar to JS one:
browser = await pyppeteer.launch( ignoreHTTPSErrors=True, args=["--proxy-server=10.11.12.13:3128"] )
Selenium
Selenium is a browser automation project which uses webdriver API. Almost all browsers implement this API so you are not limited to a single project. A browser is configured by Selenium capabilities. Here is an example on how to configure Chrome with Selenium Grid:
CRAWLERA_HEADLESS_PROXY = "10.11.12.13:3128" profile = webdriver.DesiredCapabilities.CHROME.copy() profile["proxy"] = { "httpProxy": CRAWLERA_HEADLESS_PROXY, "ftpProxy": CRAWLERA_HEADLESS_PROXY, "sslProxy": CRAWLERA_HEADLESS_PROXY, "noProxy": None, "proxyType": "MANUAL", "class": "org.openqa.selenium.Proxy", "autodetect": False } profile["acceptSslCerts"] = True driver = webdriver.Remote("http://localhost:4444/wd/hub", profile)
As you can see, the configuration is quite similar to the other headless browsers. You just need to propagate HTTP proxy settings.
Integration with Scrapy
If you want to integrate Selenium with Scrapy, please use scrapy-selenium plugin: https://github.com/scrapy-plugins/scrapy-selenium
To install it with pip, please run the following command:
$ pip install -e git+https://github.com/scrapy-plugins/scrapy-selenium.git#egg=scrapy-selenium
Update your settings.py
with the following lines:
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities SELENIUM_GRID_URL = 'http://localhost:4444/wb/hub' # Example for local grid with docker-compose SELENIUM_NODES = 3 # Number of nodes(browsers) you are running on your grid SELENIUM_CAPABILITIES = DesiredCapabilities.CHROME # Example for Chrome SELENIUM_PROXY = 'http://proxy.url:port' # You need also to change the default download handlers, like so: DOWNLOAD_HANDLERS = { "http": "scrapy_selenium.SeleniumDownloadHandler", "https": "scrapy_selenium.SeleniumDownloadHandler", }
Example of spider which uses scrapy_selenium:
from scrapy import Spider, Request from scrapy_selenium import SeleniumRequest class SomeSpider(Spider): ... def parse(self, response): ... yield Request(url, callback=self.some_parser) # This will be handled just like any scrapy request def some_parser(self, response): ... yield SeleniumRequest(some_url, callback=self.other_parser, driver_callback=self.process_webdriver) # This will be handled by Selenium Grid def process_webdriver(self, driver): ...