Cannot connect to proxy.', ConnectionResetError(10054, 'xxxxxx。', None, 10054, None

Posted over 5 years ago by ArjunPython

Post a topic
Un Answered
A
ArjunPython

import requests
headers = {
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36',
}
url = "https://twitter.com"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = "<APIKEY>:"

proxies = {"http": "http://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port),
"https": "https://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port),
}
r = requests.get(url,proxies=proxies,headers=headers)
print(r)

print("""
Requesting [{}]
through proxy [{}]

Request Headers:
{}

Response Time: {}
Response Code: {}
Response Headers:
{}

""".format(url, proxy_host, r.request.headers, r.elapsed.total_seconds(),
r.status_code, r.headers, r.text))

0 Votes


18 Comments

Sorted by
nestor

nestor posted over 5 years ago Admin

remove use-https header, that header is deprecated.

0 Votes

A

ArjunPython posted over 5 years ago

   This is my spider code:

import re
import scrapy

class TwitterSpiderSpider(scrapy.Spider):
    name = 'twitter_spider'
    allowed_domains = ['twitter.com']
    start_urls = ['https://twitter.com']
    def start_requests(self):
        url = 'https://twitter.com/account/begin_password_reset?account_identifier=starksaya'
        yield scrapy.Request(url,callback=self.parse,
                             dont_filter=True)
    def parse(self, response):
        token = response.xpath("//input[@type='hidden']/@value").extract_first()
        print(token)
        print("&"*100)
        re_name = re.match(r".*account_identifier=(.*)", response.url)
        if re_name:
            name = re_name.group(1)
            post_data = {
                "authenticity_token": token,
                "account_identifier": name
            }
            yield scrapy.FormRequest(
                                    "https://twitter.com/account/begin_password_reset",
                                     formdata=post_data,
                                     callback=self.parse_detail,
                                     dont_filter=True)
    def parse_detail(self,response):
        print(response.text)

    Setting profile: 

DOWNLOAD_DELAY = 5
COOKIES_ENABLED = True
DOWNLOADER_MIDDLEWARES = {'scrapy_crawlera.CrawleraMiddleware': 300}
CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = 'ad9defxxxxxxxxxxxx'

 Start crawler

image


 No crawling task until the request timeout ends.


But I can crawl normally with other proxy services.

Where is my configuration wrong?

This problem has plagued me for several days.

I hope that you can solve this problem.


0 Votes

nestor

nestor posted over 5 years ago Admin

Scrapy doesn't need the certificate at all.


HTTPS with Scrapy can be done as normal, and POST request should also be no different.


curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "https://httpbin.org/post" -X POST -d "some=data"

0 Votes

A

ArjunPython posted over 5 years ago

Hello, nestor, has checked the API with friends, can change the ab9dexxxx to ad9dexxx can be normally requested.

cao@cao-python:~$ curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "http://httpbin.org/ip"

* Trying 64.58.117.143...

* TCP_NODELAY set

* Connected to proxy.crawlera.com (64.58.117.143) port 8010 (#0)

* Proxy auth using Basic with user 'xxxxxxxxxxxxxxxxxxxxxxxxx'

> GET http://httpbin.org/ip HTTP/1.1

> Host: httpbin.org

> Proxy-Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxx

> User-Agent: curl/7.55.1

> Accept: */*

> Proxy-Connection: Keep-Alive

>

< HTTP/1.1 200 OK

< access-control-allow-credentials: true

< access-control-allow-origin: *

< Connection: close

< content-length: 34

< content-type: application/json

< date: Thu, 24 Jan 2019 08:28:46 GMT

* HTTP/1.1 proxy connection set close!

< Proxy-Connection: close

< server: gunicorn/19.9.0

< X-Crawlera-Slave: 192.186.134.112:4444

< X-Crawlera-Version: 1.34.6-14c425

<

{

  "origin": "192.186.134.112"

}

* Closing connection 0

 

 

Now the question is how to solve scrapy HTTPS, POST request, my machine installed crawlera-ca.crt.

0 Votes

nestor

nestor posted over 5 years ago Admin

Login to the account that has the subscription go to Help > Contact Support to open a ticket.

0 Votes

A

ArjunPython posted over 5 years ago

Well, ok, thank you for reminding. I and my friends check the API again. Not in the open forum, what communication software can I contact you directly?

0 Votes

nestor

nestor posted over 5 years ago Admin

that key doesn't exist, your friend must have changed it or you have a typo. And please be careful when posting the output of -v

* Proxy auth using Basic with user 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx' and Proxy-Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.. this is a public forum

0 Votes

A

ArjunPython posted over 5 years ago

cao@cao-python:~$ curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "http://httpbin.org/ip"

* Trying 64.58.117.175...

* TCP_NODELAY set

* Connected to proxy.crawlera.com (64.58.117.175) port 8010 (#0)

* Proxy auth using Basic with user 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

> GET http://httpbin.org/ip HTTP/1.1

> Host: httpbin.org

> Proxy-Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

> User-Agent: curl/7.55.1

> Accept: */*

> Proxy-Connection: Keep-Alive

>

< HTTP/1.1 407 Proxy Authentication Required

< Connection: close

< Content-Length: 0

< Date: Wed, 23 Jan 2019 12:43:06 GMT

* Authentication problem. Ignoring this.

< Proxy-Authenticate: Basic realm="Crawlera"

* HTTP/1.1 proxy connection set close!

< Proxy-Connection: close

< X-Crawlera-Error: bad_proxy_auth

<

* Closing connection 0

0 Votes

nestor

nestor posted over 5 years ago Admin

I've deleted that IP, that's yours.

Your curl command is wrong. It's taking -vx' as part of the username for authentication.


Try wrapping things separately in quotes:


curl -U "APIKEY:" -vx "proxy.crawlera.com:8010" "http://httpbin.org/ip"

0 Votes

A

ArjunPython posted over 5 years ago

HTTPS


cao@cao-python:~$ curl -U ab9defXXXXXXXXXXXXXXXX:-vx proxy.crawlera.com:8010 "https://httpbin.org/ip"

Enter proxy password for user 'ab9defXXXXXXXXXXXXXXXX:-vx':

{

  "origin": "XXXXXXXX"

}


0 Votes

nestor

nestor posted over 5 years ago Admin

Can you do a simple curl request from command line?


curl -U APIKEY: -vx proxy.crawlera.com:8010 "http://httpbin.org/ip"

0 Votes

A

ArjunPython posted over 5 years ago

0 Votes

nestor

nestor posted over 5 years ago Admin

Does it happen for any website? Or just for those two?

0 Votes

A

ArjunPython posted over 5 years ago

I introduced the problem I encountered.  I am a Chinese user, My machine virtual system is ubuntu16.04, python 3.5, Scrapy 1.4.0, scrapy-crawlera 1.4.0, requests 2.18.4 The firewall is off. The machine's 8010 port is set.

Setting information

8010 ALLOW IN Anywhere

8010/tcp ALLOW IN Anywhere

443 ALLOW IN Anywhere

443/tcp ALLOW IN Anywhere

3306 (v6) ALLOW IN Anywhere (v6)

6379 (v6) ALLOW IN Anywhere (v6)

5000 (v6) ALLOW IN Anywhere (v6)

8010 (v6) ALLOW IN Anywhere (v6)

8010/tcp (v6) ALLOW IN Anywhere (v6)

443 (v6) ALLOW IN Anywhere (v6)

443/tcp (v6) ALLOW IN Anywhere (v6)

 

I follow the documentation to request:

Import requests

Headers = {

    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36',

}

Url = "https://www.google.com/"

Proxy_host = "proxy.crawlera.com"

Proxy_port = "8010"

Proxy_auth = "<APIKEY>:"

Proxies = {"https": "https://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port),

      "http": "http://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port)}

r = requests.get(url, proxies=proxies,headers=headers,verify=False)

Print(r)

 

Here is the content of each request error:

 

Traceback (most recent call last):

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/connectionpool.py", line 595, in urlopen

    Self._prepare_proxy(conn)

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/connectionpool.py", line 816, in _prepare_proxy

    Conn.connect()

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/connection.py", line 294, in connect

    Self._tunnel()

  File "/usr/lib/python3.5/http/client.py", line 827, in _tunnel

    (version, code, message) = response._read_status()

  File "/usr/lib/python3.5/http/client.py", line 258, in _read_status

    Line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")

  File "/usr/lib/python3.5/socket.py", line 575, in readinto

    Return self._sock.recv_into(b)

ConnectionResetError: [Errno 104] Connection reset by peer

 

During handling of the above exception, another exception occurred:

 

Traceback (most recent call last):

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/adapters.py", line 440, in send

    Timeout=timeout

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/connectionpool.py", line 639, in urlopen

    _stacktrace=sys.exc_info()[2])

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/util/retry.py", line 388, in increment

    Raise MaxRetryError(_pool, url, error or ResponseError(cause))

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset By peer')))

 

During handling of the above exception, another exception occurred:

 

Traceback (most recent call last):

  File "/home/cao/Desktop/wang_dai_spider/demo_1.py", line 20, in <module>

    r = requests.get(url, proxies=proxies,headers=headers,verify=False)

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/api.py", line 72, in get

    Return request('get', url, params=params, **kwargs)

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/api.py", line 58, in request

    Return session.request(method=method, url=url, **kwargs)

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/sessions.py", line 508, in request

    Resp = self.send(prep, **send_kwargs)

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/sessions.py", line 618, in send

    r = adapter.send(request, **kwargs)

  File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/adapters.py", line 502, in send

    Raise ProxyError(e, request=request)

requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset By peer')))

 

Here is the scrapy configuration information:

DOWNLOADER_MIDDLEWARES= {'scrapy_crawlera.CrawleraMiddleware': 300}

CRAWLERA_ENABLED = True

CRAWLERA_USER = ''APIKEY'

CRAWLERA_PASS = 'xxxx'


Crawl the log for twitter:

2019-01-23 11:37:05 [scrapy.core.engine] INFO: Spider opened

2019-01-23 11:37:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2019-01-23 11:37:05 [root] INFO: Using crawlera at http://proxy.crawlera.com:8010 (user: ad9dexxxxxx)

2019-01-23 11:37:05 [root] INFO: CrawleraMiddleware: disabling download delays on Scrapy side to optimize delays introduced by Crawlera. To avoid this behaviour you can use the CRAWLERA_PRESERVE_DELAY setting but keep in mind that this may slow down the Crawling

2019-01-23 11:37:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2019-01-23 11:38:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2019-01-23 11:39:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2019-01-23 11:40:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2019-01-23 11:40:15 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://twitter.com/> (failed 1 times): User timeout caused connection failure: Getting https://twitter. Com/take longer than 190 seconds..

2019-01-23 11:41:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2019-01-23 11:42:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2019-01-23 11:43:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2019-01-23 11:43:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://twitter.com/> (failed 2 times): User timeout caused connection failure: Getting https://twitter. Com/take longer than 190 seconds..



Scrapy ends overtime

I hope to reply as soon as possible, thank you very much.



0 Votes

nestor

nestor posted over 5 years ago Admin

Have you checked the port like I mentioned?

0 Votes

A

ArjunPython posted over 5 years ago

Yes, I haven't subscribed to Crawlera yet. I am currently testing my friend's account to see if I can crawl twitter, but why have I been prompted for proxy errors?

0 Votes

nestor

nestor posted over 5 years ago Admin

I don't see a Crawlera subscription on your account.

In any case, seems like you might need to open port 8010 on your machine.

0 Votes

A

ArjunPython posted over 5 years ago

python 3.6.3

requests 2.20

0 Votes

Login to post a comment