Now the question is how to solve scrapy HTTPS, POST request, my machine installed crawlera-ca.crt.
0 Votes
nestorposted
almost 6 years ago
Admin
Login to the account that has the subscription go to Help > Contact Support to open a ticket.
0 Votes
A
ArjunPythonposted
almost 6 years ago
Well, ok, thank you for reminding. I and my friends check the API again. Not in the open forum, what communication software can I contact you directly?
0 Votes
nestorposted
almost 6 years ago
Admin
that key doesn't exist, your friend must have changed it or you have a typo. And please be careful when posting the output of -v
* Proxy auth using Basic with user 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx' and Proxy-Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.. this is a public forum
Does it happen for any website? Or just for those two?
0 Votes
A
ArjunPythonposted
almost 6 years ago
I introduced the problem I encountered. I am a Chinese user, My machine virtual system is ubuntu16.04, python 3.5, Scrapy 1.4.0, scrapy-crawlera 1.4.0, requests 2.18.4 The firewall is off. The machine's 8010 port is set.
Setting information
8010 ALLOW IN Anywhere
8010/tcp ALLOW IN Anywhere
443 ALLOW IN Anywhere
443/tcp ALLOW IN Anywhere
3306 (v6) ALLOW IN Anywhere (v6)
6379 (v6) ALLOW IN Anywhere (v6)
5000 (v6) ALLOW IN Anywhere (v6)
8010 (v6) ALLOW IN Anywhere (v6)
8010/tcp (v6) ALLOW IN Anywhere (v6)
443 (v6) ALLOW IN Anywhere (v6)
443/tcp (v6) ALLOW IN Anywhere (v6)
I follow the documentation to request:
Import requests
Headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36',
File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
Line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.5/socket.py", line 575, in readinto
Return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/adapters.py", line 440, in send
Timeout=timeout
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/util/retry.py", line 388, in increment
Raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset By peer')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/cao/Desktop/wang_dai_spider/demo_1.py", line 20, in <module>
r = requests.get(url, proxies=proxies,headers=headers,verify=False)
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/api.py", line 72, in get
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/sessions.py", line 508, in request
Resp = self.send(prep, **send_kwargs)
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/adapters.py", line 502, in send
Raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset By peer')))
2019-01-23 11:37:05 [root] INFO: CrawleraMiddleware: disabling download delays on Scrapy side to optimize delays introduced by Crawlera. To avoid this behaviour you can use the CRAWLERA_PRESERVE_DELAY setting but keep in mind that this may slow down the Crawling
2019-01-23 11:37:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
I hope to reply as soon as possible, thank you very much.
0 Votes
nestorposted
almost 6 years ago
Admin
Have you checked the port like I mentioned?
0 Votes
A
ArjunPythonposted
almost 6 years ago
Yes, I haven't subscribed to Crawlera yet. I am currently testing my friend's account to see if I can crawl twitter, but why have I been prompted for proxy errors?
0 Votes
nestorposted
almost 6 years ago
Admin
I don't see a Crawlera subscription on your account.
In any case, seems like you might need to open port 8010 on your machine.
0 Votes
18 Comments
nestor posted almost 6 years ago Admin
remove use-https header, that header is deprecated.
0 Votes
ArjunPython posted almost 6 years ago
This is my spider code:
Setting profile:
Start crawler
No crawling task until the request timeout ends.
But I can crawl normally with other proxy services.
Where is my configuration wrong?
This problem has plagued me for several days.
I hope that you can solve this problem.
0 Votes
nestor posted almost 6 years ago Admin
Scrapy doesn't need the certificate at all.
HTTPS with Scrapy can be done as normal, and POST request should also be no different.
curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "https://httpbin.org/post" -X POST -d "some=data"
0 Votes
ArjunPython posted almost 6 years ago
Hello, nestor, has checked the API with friends, can change the ab9dexxxx to ad9dexxx can be normally requested.
cao@cao-python:~$ curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "http://httpbin.org/ip"
* Trying 64.58.117.143...
* TCP_NODELAY set
* Connected to proxy.crawlera.com (64.58.117.143) port 8010 (#0)
* Proxy auth using Basic with user 'xxxxxxxxxxxxxxxxxxxxxxxxx'
> GET http://httpbin.org/ip HTTP/1.1
> Host: httpbin.org
> Proxy-Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxx
> User-Agent: curl/7.55.1
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 200 OK
< access-control-allow-credentials: true
< access-control-allow-origin: *
< Connection: close
< content-length: 34
< content-type: application/json
< date: Thu, 24 Jan 2019 08:28:46 GMT
* HTTP/1.1 proxy connection set close!
< Proxy-Connection: close
< server: gunicorn/19.9.0
< X-Crawlera-Slave: 192.186.134.112:4444
< X-Crawlera-Version: 1.34.6-14c425
<
{
"origin": "192.186.134.112"
}
* Closing connection 0
Now the question is how to solve scrapy HTTPS, POST request, my machine installed crawlera-ca.crt.
0 Votes
nestor posted almost 6 years ago Admin
Login to the account that has the subscription go to Help > Contact Support to open a ticket.
0 Votes
ArjunPython posted almost 6 years ago
Well, ok, thank you for reminding. I and my friends check the API again. Not in the open forum, what communication software can I contact you directly?
0 Votes
nestor posted almost 6 years ago Admin
that key doesn't exist, your friend must have changed it or you have a typo. And please be careful when posting the output of -v
* Proxy auth using Basic with user 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx' and Proxy-Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxxxxxx.. this is a public forum
0 Votes
ArjunPython posted almost 6 years ago
cao@cao-python:~$ curl -U "xxxxxxxxxxxxxxxxxxxxxxxxx:" -vx "proxy.crawlera.com:8010" "http://httpbin.org/ip"
* Trying 64.58.117.175...
* TCP_NODELAY set
* Connected to proxy.crawlera.com (64.58.117.175) port 8010 (#0)
* Proxy auth using Basic with user 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
> GET http://httpbin.org/ip HTTP/1.1
> Host: httpbin.org
> Proxy-Authorization: Basic xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> User-Agent: curl/7.55.1
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 407 Proxy Authentication Required
< Connection: close
< Content-Length: 0
< Date: Wed, 23 Jan 2019 12:43:06 GMT
* Authentication problem. Ignoring this.
< Proxy-Authenticate: Basic realm="Crawlera"
* HTTP/1.1 proxy connection set close!
< Proxy-Connection: close
< X-Crawlera-Error: bad_proxy_auth
<
* Closing connection 0
0 Votes
nestor posted almost 6 years ago Admin
I've deleted that IP, that's yours.
Your curl command is wrong. It's taking -vx' as part of the username for authentication.
Try wrapping things separately in quotes:
curl -U "APIKEY:" -vx "proxy.crawlera.com:8010" "http://httpbin.org/ip"
0 Votes
ArjunPython posted almost 6 years ago
HTTPS
cao@cao-python:~$ curl -U ab9defXXXXXXXXXXXXXXXX:-vx proxy.crawlera.com:8010 "https://httpbin.org/ip"
Enter proxy password for user 'ab9defXXXXXXXXXXXXXXXX:-vx':
{
"origin": "XXXXXXXX"
}
0 Votes
nestor posted almost 6 years ago Admin
Can you do a simple curl request from command line?
curl -U APIKEY: -vx proxy.crawlera.com:8010 "http://httpbin.org/ip"
0 Votes
ArjunPython posted almost 6 years ago
I have tried https://www.facebook.com/, https://twitter.com/, https://www.youtube.com/, https://www.google.com/
0 Votes
nestor posted almost 6 years ago Admin
Does it happen for any website? Or just for those two?
0 Votes
ArjunPython posted almost 6 years ago
I introduced the problem I encountered. I am a Chinese user, My machine virtual system is ubuntu16.04, python 3.5, Scrapy 1.4.0, scrapy-crawlera 1.4.0, requests 2.18.4 The firewall is off. The machine's 8010 port is set.
Setting information
8010 ALLOW IN Anywhere
8010/tcp ALLOW IN Anywhere
443 ALLOW IN Anywhere
443/tcp ALLOW IN Anywhere
3306 (v6) ALLOW IN Anywhere (v6)
6379 (v6) ALLOW IN Anywhere (v6)
5000 (v6) ALLOW IN Anywhere (v6)
8010 (v6) ALLOW IN Anywhere (v6)
8010/tcp (v6) ALLOW IN Anywhere (v6)
443 (v6) ALLOW IN Anywhere (v6)
443/tcp (v6) ALLOW IN Anywhere (v6)
I follow the documentation to request:
Import requests
Headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36',
}
Url = "https://www.google.com/"
Proxy_host = "proxy.crawlera.com"
Proxy_port = "8010"
Proxy_auth = "<APIKEY>:"
Proxies = {"https": "https://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port),
"http": "http://{}@{}:{}/".format(proxy_auth, proxy_host, proxy_port)}
r = requests.get(url, proxies=proxies,headers=headers,verify=False)
Print(r)
Here is the content of each request error:
Traceback (most recent call last):
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/connectionpool.py", line 595, in urlopen
Self._prepare_proxy(conn)
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/connectionpool.py", line 816, in _prepare_proxy
Conn.connect()
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/connection.py", line 294, in connect
Self._tunnel()
File "/usr/lib/python3.5/http/client.py", line 827, in _tunnel
(version, code, message) = response._read_status()
File "/usr/lib/python3.5/http/client.py", line 258, in _read_status
Line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.5/socket.py", line 575, in readinto
Return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/adapters.py", line 440, in send
Timeout=timeout
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/urllib3/util/retry.py", line 388, in increment
Raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset By peer')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/cao/Desktop/wang_dai_spider/demo_1.py", line 20, in <module>
r = requests.get(url, proxies=proxies,headers=headers,verify=False)
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/api.py", line 72, in get
Return request('get', url, params=params, **kwargs)
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/api.py", line 58, in request
Return session.request(method=method, url=url, **kwargs)
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/sessions.py", line 508, in request
Resp = self.send(prep, **send_kwargs)
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/home/cao/.virtualenvs/spider/lib/python3.5/site-packages/requests/adapters.py", line 502, in send
Raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', ConnectionResetError(104, 'Connection reset By peer')))
Here is the scrapy configuration information:
DOWNLOADER_MIDDLEWARES= {'scrapy_crawlera.CrawleraMiddleware': 300}
CRAWLERA_ENABLED = True
CRAWLERA_USER = ''APIKEY'
CRAWLERA_PASS = 'xxxx'
Crawl the log for twitter:
2019-01-23 11:37:05 [scrapy.core.engine] INFO: Spider opened
2019-01-23 11:37:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-23 11:37:05 [root] INFO: Using crawlera at http://proxy.crawlera.com:8010 (user: ad9dexxxxxx)
2019-01-23 11:37:05 [root] INFO: CrawleraMiddleware: disabling download delays on Scrapy side to optimize delays introduced by Crawlera. To avoid this behaviour you can use the CRAWLERA_PRESERVE_DELAY setting but keep in mind that this may slow down the Crawling
2019-01-23 11:37:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-01-23 11:38:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-23 11:39:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-23 11:40:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-23 11:40:15 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://twitter.com/> (failed 1 times): User timeout caused connection failure: Getting https://twitter. Com/take longer than 190 seconds..
2019-01-23 11:41:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-23 11:42:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-23 11:43:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-23 11:43:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://twitter.com/> (failed 2 times): User timeout caused connection failure: Getting https://twitter. Com/take longer than 190 seconds..
Scrapy ends overtime
I hope to reply as soon as possible, thank you very much.
0 Votes
nestor posted almost 6 years ago Admin
Have you checked the port like I mentioned?
0 Votes
ArjunPython posted almost 6 years ago
Yes, I haven't subscribed to Crawlera yet. I am currently testing my friend's account to see if I can crawl twitter, but why have I been prompted for proxy errors?
0 Votes
nestor posted almost 6 years ago Admin
I don't see a Crawlera subscription on your account.
In any case, seems like you might need to open port 8010 on your machine.
0 Votes
ArjunPython posted almost 6 years ago
python 3.6.3
requests 2.20
0 Votes
Login to post a comment