Hello, nestor, has checked the API with friends, can change the ab9dexxxx to ad9dexxx can be normally requested.
cao@cao-python:~$ curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "http://httpbin.org/ip"
* Trying 64.58.117.143...
* TCP_NODELAY set
* Connected to proxy.crawlera.com (64.58.117.143) port 8010 (#0)
* Proxy auth using Basic with user 'xxxxxxxxxxxxxxxxxxxxxxxxx'
> GET http://httpbin.org/ip HTTP/1.1
> Host: httpbin.org
> Proxy-Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxx
> User-Agent: curl/7.55.1
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 200 OK
< access-control-allow-credentials: true
< access-control-allow-origin: *
< Connection: close
< content-length: 34
< content-type: application/json
< date: Thu, 24 Jan 2019 08:28:46 GMT
* HTTP/1.1 proxy connection set close!
< Proxy-Connection: close
< server: gunicorn/19.9.0
< X-Crawlera-Slave: 192.186.134.112:4444
< X-Crawlera-Version: 1.34.6-14c425
<
{
"origin": "192.186.134.112"
}
* Closing connection 0
Now the question is how to solve scrapy HTTPS, POST request, my machine installed crawlera-ca.crt.
Scrapy doesn't need the certificate at all.
HTTPS with Scrapy can be done as normal, and POST request should also be no different.
curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "https://httpbin.org/post" -X POST -d "some=data"
This is my spider code:
import re import scrapy class TwitterSpiderSpider(scrapy.Spider): name = 'twitter_spider' allowed_domains = ['twitter.com'] start_urls = ['https://twitter.com'] def start_requests(self): url = 'https://twitter.com/account/begin_password_reset?account_identifier=starksaya' yield scrapy.Request(url,callback=self.parse, dont_filter=True) def parse(self, response): token = response.xpath("//input[@type='hidden']/@value").extract_first() print(token) print("&"*100) re_name = re.match(r".*account_identifier=(.*)", response.url) if re_name: name = re_name.group(1) post_data = { "authenticity_token": token, "account_identifier": name } yield scrapy.FormRequest( "https://twitter.com/account/begin_password_reset", formdata=post_data, callback=self.parse_detail, dont_filter=True) def parse_detail(self,response): print(response.text)
Setting profile:
DOWNLOAD_DELAY = 5 COOKIES_ENABLED = True DOWNLOADER_MIDDLEWARES = {'scrapy_crawlera.CrawleraMiddleware': 300} CRAWLERA_ENABLED = True CRAWLERA_APIKEY = 'ad9defxxxxxxxxxxxx'
Start crawler
No crawling task until the request timeout ends.
But I can crawl normally with other proxy services.
Where is my configuration wrong?
This problem has plagued me for several days.
I hope that you can solve this problem.
remove use-https header, that header is deprecated.
ArjunPython