Scrapy + Crawlera concurrent requests

Posted almost 7 years ago by Analizy Whites

Post a topic

Answered

Analizy Whites

I would like to if it is possible to crawl https pages using scrapy + crawlera. So far I was using Python requests with the following settings:

proxy_host = 'proxy.crawlera.com'

proxy_port = '8010'

proxy_auth = 'MY_KEY'

proxies = {

"https": "https://{}@{}:{}/".format(proxy_auth, proxy_host,

proxy_port),

"http": "http://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port)

}

ca_cert = 'crawlera-ca.crt'

res = requests.get(url='https://www.google.com/',

proxies=proxies,

verify=ca_cert

)

I want to move into async execution via Scrapy. I know there is [scrapy-crawlera](https://github.com/scrapy-plugins/scrapy-crawlera) plugin, but I do not know how to configure it when I have the certificate.

Also, one thing bothers me. Crawlera comes with different pricing plans. The basic one is C10 which allows for 10 concurrent requests. What does it mean? Do I need to set `CONCURRENT_REQUESTS=10` in settings.py?

0 Votes

nestor posted almost 7 years ago Admin Best Answer

Only thing to configure is the scrapy-crawlera settings in your Settings.py https://scrapy-crawlera.readthedocs.io/en/v1.4.0/settings.html.

The certificate is not needed with Scrapy because it doesn't employ CONNECT method.

And yes, the pricing plan means exactly that. CONCURRENT_REQUESTS = X number allowed by plan.

0 Votes

1 Comments

nestor posted almost 7 years ago Admin Answer

Only thing to configure is the scrapy-crawlera settings in your Settings.py https://scrapy-crawlera.readthedocs.io/en/v1.4.0/settings.html.

The certificate is not needed with Scrapy because it doesn't employ CONNECT method.

And yes, the pricing plan means exactly that. CONCURRENT_REQUESTS = X number allowed by plan.

0 Votes