Start a new topic
Answered

Scrapy + Crawlera concurrent requests

I would like to if it is possible to crawl https pages using scrapy + crawlera. So far I was using Python requests with the following settings:


    proxy_host = 'proxy.crawlera.com'

    proxy_port = '8010'

    proxy_auth = 'MY_KEY'

    proxies = {

        "https": "https://{}@{}:{}/".format(proxy_auth, proxy_host, 

    proxy_port),

        "http": "http://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port)

    }

    ca_cert = 'crawlera-ca.crt'


    res = requests.get(url='https://www.google.com/',

        proxies=proxies,

        verify=ca_cert

    )


I want to move into async execution via Scrapy. I know there is [scrapy-crawlera](https://github.com/scrapy-plugins/scrapy-crawlera) plugin, but I do not know how to configure it when I have the certificate. 


Also, one thing bothers me. Crawlera comes with different pricing plans. The basic one is C10 which allows for 10 concurrent requests. What does it mean? Do I need to set `CONCURRENT_REQUESTS=10` in settings.py?



Best Answer

Only thing to configure is the scrapy-crawlera settings in your Settings.py https://scrapy-crawlera.readthedocs.io/en/v1.4.0/settings.html.

The certificate is not needed with Scrapy because it doesn't employ CONNECT method.


And yes, the pricing plan means exactly that. CONCURRENT_REQUESTS = X number allowed by plan.

1 Comment

Answer

Only thing to configure is the scrapy-crawlera settings in your Settings.py https://scrapy-crawlera.readthedocs.io/en/v1.4.0/settings.html.

The certificate is not needed with Scrapy because it doesn't employ CONNECT method.


And yes, the pricing plan means exactly that. CONCURRENT_REQUESTS = X number allowed by plan.

Login to post a comment