Using Zyte Smart Proxy Manager with Splash & Scrapy

Modified on Thu, 24 Jun, 2021 at 4:59 AM

Using Zyte Smart Proxy Manager(formerly Crawlera) with Splash is possible, but you have to keep some things in mind before integrating them.

Unlike a standard proxy, Smart Proxy Manager is designed for crawling and it throttles requests speed to avoid users getting banned or imposing too much load on websites. This throttling translates to Splash being slow when also using Smart Proxy Manager.

When you access a web page in a browser (like Splash), you typically have to download many resources to render it (images, CSS styles, JavaScript code, etc.) and each resource is fetched by a different request against the site. Smart Proxy Manager will throttle each request separately, which means that the load time of the page will increase dramatically.

To avoid the page loading being too slow, you should avoid unnecessary requests. You can do so by:

Disabling images in Splash
Blocking requests to advertisement and tracking domains
Not using Smart Proxy Manager for subresource requests when not necessary (for example, you probably don't need Smart Proxy Manager to fetch jQuery from a static CDN)

How to integrate them

To make Splash and Smart Proxy Manager work together, you'll need to pass a Lua script similar to this example to Splash's /execute endpoint. This script will configure Splash to use Smart Proxy Manager as a proxy and will also perform some optimizations, such as disabling images and avoiding some sorts of requests. It will also make sure that the Splash requests go through the same IP address, by creating a Smart Proxy Manager session.

In order to make it work, you have to provide your Smart Proxy Manager API key via the crawlera_user argument for your Splash requests (example). Or, if you prefer, you could hardcode your API key in the script.

Using Splash + Smart Proxy Manager with Scrapy via scrapy-splash

Let's dive into an example to see how to use Smart Proxy Manager and Splash in a Scrapy spider via scrapy-splash (for the full working example, check this repo).

This is the project structure:

├── scrapy.cfg
├── setup.py
└── splash_smart_proxy_manager_example
    ├── __init__.py
    ├── settings.py
    ├── scripts
    │   └── smart_proxy_manager.lua
    └── spiders
        ├── __init__.py
        └── quotes-js.py

A few details about the files listed above:

settings.py: contains the configurations for both Smart Proxy Manager and Splash, including the API keys required for authorization (note that Smart Proxy Manager should be disabled in the settings, since routing requests to Smart Proxy Manager is handled by the Lua script mentioned below).
scripts/smart_proxy_manager.lua: the Lua script that integrates Splash and Smart Proxy Manager.
spiders/quote-js.py: the spider that needs Splash and Smart Proxy Manager for its requests. This spider loads the Lua script into a string and sends it along with its requests.

In our spider, we load the Lua script into a string in the __init__ method:

self.LUA_SOURCE = get_data(
    'splash_smart_proxy_manager_example', 'scripts/smart_proxy_manager.lua'
).decode('utf-8')

Note: to load the script from a file both locally and on Scrapy Cloud, you have to include the Lua script in your package's setup.py file, as shown below:

from setuptools import setup, find_packages

setup(
    name = 'project',
    version = '1.0',
    packages = find_packages(),
    package_data = {'splash_smart_proxy_manager_example': ['scripts/*.lua',]},
    entry_points = {'scrapy': ['settings = splash_smart_proxy_manager_example.settings']},
)

Once we have the Lua script loaded in our spider, we pass it as an argument to the SplashRequest objects, along with Smart Proxy Manager's and Splash's credentials (authorization with Splash can be also be done via http_user setting):

yield SplashRequest(
    url='http://quotes.toscrape.com/js',
    endpoint='execute',
    splash_headers={
        'Authorization': basic_auth_header(self.settings['SPLASH_APIKEY'], ''),
    },
    args={
        'lua_source': self.LUA_SOURCE,
        'crawlera_user': self.settings['CRAWLERA_APIKEY'],
    },
    cache_args=['lua_source'],
)

And that's it. Now, this request will go through your Splash instance, and Splash will use Smart Proxy Manager as its proxy to download the pages and resources you need.

Customizing the Lua Script

You can go further and customize the Lua script to fit your exact requirements. In the example provided here, we commented out some lines that filter requests to useless resources or undesired domains. You can uncomment those and customize them to your own needs. Check out Splash's official docs to lean more about scripting.

A working example for Scrapy Cloud

You can find a working example of a Scrapy project using Splash and Smart Proxy Manager in this repository. The example is ready to be executed on Scrapy Cloud.