Start a new topic

Cannot connect to proxy.', ConnectionResetError(10054, 'xxxxxx。', None, 10054, None

import requests
headers = {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36',
}
url = "https://twitter.com"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = "<APIKEY>:"

proxies = {"http": "http://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port),
"https": "https://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port),
}
r = requests.get(url,proxies=proxies,headers=headers)
print(r)

print("""
Requesting [{}]
through proxy [{}]

Request Headers:
{}

Response Time: {}
Response Code: {}
Response Headers:
{}

""".format(url, proxy_host, r.request.headers, r.elapsed.total_seconds(),
r.status_code, r.headers, r.text))

Hello, nestor, has checked the API with friends, can change the ab9dexxxx to ad9dexxx can be normally requested.

cao@cao-python:~$ curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "http://httpbin.org/ip"

* Trying 64.58.117.143...

* TCP_NODELAY set

* Connected to proxy.crawlera.com (64.58.117.143) port 8010 (#0)

* Proxy auth using Basic with user 'xxxxxxxxxxxxxxxxxxxxxxxxx'

> GET http://httpbin.org/ip HTTP/1.1

> Host: httpbin.org

> Proxy-Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxx

> User-Agent: curl/7.55.1

> Accept: */*

> Proxy-Connection: Keep-Alive

>

< HTTP/1.1 200 OK

< access-control-allow-credentials: true

< access-control-allow-origin: *

< Connection: close

< content-length: 34

< content-type: application/json

< date: Thu, 24 Jan 2019 08:28:46 GMT

* HTTP/1.1 proxy connection set close!

< Proxy-Connection: close

< server: gunicorn/19.9.0

< X-Crawlera-Slave: 192.186.134.112:4444

< X-Crawlera-Version: 1.34.6-14c425

<

{

  "origin": "192.186.134.112"

}

* Closing connection 0

 

 

Now the question is how to solve scrapy HTTPS, POST request, my machine installed crawlera-ca.crt.

Scrapy doesn't need the certificate at all.


HTTPS with Scrapy can be done as normal, and POST request should also be no different.


curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "https://httpbin.org/post" -X POST -d "some=data"

   This is my spider code:

import re
import scrapy

class TwitterSpiderSpider(scrapy.Spider):
    name = 'twitter_spider'
    allowed_domains = ['twitter.com']
    start_urls = ['https://twitter.com']
    def start_requests(self):
        url = 'https://twitter.com/account/begin_password_reset?account_identifier=starksaya'
        yield scrapy.Request(url,callback=self.parse,
                             dont_filter=True)
    def parse(self, response):
        token = response.xpath("//input[@type='hidden']/@value").extract_first()
        print(token)
        print("&"*100)
        re_name = re.match(r".*account_identifier=(.*)", response.url)
        if re_name:
            name = re_name.group(1)
            post_data = {
                "authenticity_token": token,
                "account_identifier": name
            }
            yield scrapy.FormRequest(
                                    "https://twitter.com/account/begin_password_reset",
                                     formdata=post_data,
                                     callback=self.parse_detail,
                                     dont_filter=True)
    def parse_detail(self,response):
        print(response.text)

    Setting profile: 

DOWNLOAD_DELAY = 5
COOKIES_ENABLED = True
DOWNLOADER_MIDDLEWARES = {'scrapy_crawlera.CrawleraMiddleware': 300}
CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = 'ad9defxxxxxxxxxxxx'

 Start crawler

image


 No crawling task until the request timeout ends.


But I can crawl normally with other proxy services.

Where is my configuration wrong?

This problem has plagued me for several days.

I hope that you can solve this problem.


remove use-https header, that header is deprecated.

Login to post a comment