Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot bootstrap with SSL using 1.4.5 on python 3.7 #1741

Closed
gingercookiemage opened this issue Mar 15, 2019 · 19 comments
Closed

Cannot bootstrap with SSL using 1.4.5 on python 3.7 #1741

gingercookiemage opened this issue Mar 15, 2019 · 19 comments

Comments

@gingercookiemage
Copy link

gingercookiemage commented Mar 15, 2019

After switching to new kafka-python 1.4.5 and upgrading to python 3.7 I keep getting these logs in infinite loop:

Give up sending metadata request since no node is available

I have noticed that broker list is always empty and node-id=bootstrap . Running the same code with 1.4.4 and python 3.6 worked perfectly.
KafkaProducer( bootstrap_servers='{}:{}'.format( config.get('tracing_host'), config.get('tracing_port') ), api_version=(0, 10, 1), security_protocol='SSL', ssl_check_hostname=True, ssl_cafile=config.get('ssl_ca_file'), ssl_certfile=config.get('ssl_cert_file'), ssl_keyfile=config.get('ssl_key_file') )
Any ideas what to try next?

@jeffwidman
Copy link
Collaborator

jeffwidman commented Mar 15, 2019

I suspect this is a small bug that crept in somewhere through this change: #1736

Can you post DEBUG-level logs?

Does this happen immediately upon start, or does it work fine for a while and then you hit this? (Since the purpose of the change was to infinitely retry)

Do you also hit this with KafkaConsumer? Or is it only KafkaProducer that fails for you?

If there's a bug, it will be a KafkaClient level issue, so I expect will manifest in both consumers and producers. However, I just tried with KafkaConsumer in a non-SSL environment but cannot reproduce...

@dpkp
Copy link
Owner

dpkp commented Mar 15, 2019 via email

@dpkp
Copy link
Owner

dpkp commented Mar 16, 2019

Can you try 1.4.5 w/ python 3.6 and see if you get the same results? My hunch is this is related to ssl changes in python 3.7.

@gingercookiemage
Copy link
Author

gingercookiemage commented Mar 16, 2019

This happens immediately after KafkaProducer tries to initialize. I haven't tried this with the Consumer as I only need to post some messages.

DEBUG       - Starting the Kafka producer [kafka.producer.kafka:336]
DEBUG       - Added sensor with name connections-closed [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name connections-created [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name select-time [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name io-time [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name bufferpool-wait-time [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name batch-size [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name compression-rate [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name queue-time [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name produce-throttle-time [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name records-per-request [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name bytes [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name record-retries [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name errors [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name record-size-max [kafka.metrics.metrics:156]
DEBUG       - Starting Kafka producer I/O thread. [kafka.producer.sender:55]
DEBUG       - Initializing connection to node bootstrap for metadata request [kafka.client:814]
DEBUG       - Kafka producer started [kafka.producer.kafka:404]
DEBUG       - Requesting metadata update for topic zipkin [kafka.producer.kafka:672]
DEBUG       - Initiating connection to node bootstrap at xx.xx.xxx.xxx:9093 [kafka.client:366]
DEBUG       - Added sensor with name bytes-sent-received [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name bytes-sent [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name bytes-received [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name request-latency [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name node-bootstrap.bytes-sent [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name node-bootstrap.bytes-received [kafka.metrics.metrics:156]
DEBUG       - Added sensor with name node-bootstrap.latency [kafka.metrics.metrics:156]
DEBUG       - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <disconnected> [IPv4 None]>: creating new socket [kafka.conn:349]
DEBUG       - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <disconnected> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: setting socket option (6, 1, 1) [kafka.conn:354]
DEBUG       - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <connecting> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: configuring default SSL Context [kafka.conn:446]
INFO        - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <connecting> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: Loading SSL CA from ca.pem [kafka.conn:454]
INFO        - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <connecting> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: Loading SSL Cert from cert.pem [kafka.conn:458]
INFO        - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <connecting> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: Loading SSL Key from rapid.key [kafka.conn:459]
DEBUG       - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <connecting> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: wrapping socket in ssl context [kafka.conn:471]
INFO        - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <connecting> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: connecting to xx.xx.xxx.xxx:9093 [('xx.xx.xxx.xxx', 9093) IPv4] [kafka.conn:366]
DEBUG       - Give up sending metadata request since no node is available [kafka.client:783]
DEBUG       - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <connecting> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: established TCP connection [kafka.conn:386]
DEBUG       - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <connecting> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: initiating SSL handshake [kafka.conn:388]
DEBUG       - Give up sending metadata request since no node is available [kafka.client:783]
DEBUG       - Give up sending metadata request since no node is available [kafka.client:783]
DEBUG       - Give up sending metadata request since no node is available [kafka.client:783]
DEBUG       - Give up sending metadata request since no node is available [kafka.client:783]
DEBUG       - Give up sending metadata request since no node is available [kafka.client:783]

This last log keeps going indefinitely until it crashes my app.

@gingercookiemage
Copy link
Author

gingercookiemage commented Mar 16, 2019

Can you try 1.4.5 w/ python 3.6 and see if you get the same results? My hunch is this is related to ssl changes in python 3.7.

With this configuration this error keeps getting logged:

ERROR       - Uncaught error in kafka producer I/O thread [kafka.producer.sender:62]
Traceback (most recent call last):
  File "/Users/test/envs/service/lib/python3.6/site-packages/kafka/producer/sender.py", line 60, in run
    self.run_once()
  File "/Users/test/envs/service/lib/python3.6/site-packages/kafka/producer/sender.py", line 159, in run_once
    self._client.poll(poll_timeout_ms)
  File "/Users/test/envs/service/lib/python3.6/site-packages/kafka/client_async.py", line 567, in poll
    self._maybe_connect(node_id)
  File "/Users/test/envs/service/lib/python3.6/site-packages/kafka/client_async.py", line 383, in _maybe_connect
    conn.connect()
  File "/Users/test/envs/service/lib/python3.6/site-packages/kafka/conn.py", line 413, in connect
    if self._try_handshake():
  File "/Users/test/envs/service/lib/python3.6/site-packages/kafka/conn.py", line 484, in _try_handshake
    self._sock.do_handshake()
  File "/Users/test/.pyenv/versions/3.6.2/lib/python3.6/ssl.py", line 1061, in do_handshake
    self._sslobj.do_handshake()
  File "/Users/test/.pyenv/versions/3.6.2/lib/python3.6/ssl.py", line 688, in do_handshake
    match_hostname(self.getpeercert(), self.server_hostname)
  File "/Users/test/.pyenv/versions/3.6.2/lib/python3.6/ssl.py", line 321, in match_hostname
    % (hostname, ', '.join(map(repr, dnsnames))))
ssl.CertificateError: hostname 'xx.xx.xxx.xxx' doesn't match either of '*.xxxxxx.com', 'xxxxxx.com'
ERROR       - Uncaught error in kafka producer I/O thread [kafka.producer.sender:62]

For now only combination that works is python 3.6/ kafka-python 1.4.4

@dpkp
Copy link
Owner

dpkp commented Mar 16, 2019

Thanks for the logs! It looks like the ssl.CertificateError is the root cause of the problem. Does your actual data suggest that this should not be an error here? I don't have a theory right now on why you would be able to get around this error w/ 1.4.4 but not 1.4.5. If there is a certificate error, I would expect that neither 1.4.4 or 1.4.5 would allow you to connect.

Re: the difference between python 3.6 and 3.7 in kafka-python, PR #1694 looks like it covers ssl.CertificateError in 3.7 but not 3.6, which means this error is silently ignored on 3.7 when really it should be raised.

@dpkp
Copy link
Owner

dpkp commented Mar 16, 2019

Also, you might try setting ssl_check_hostname=False -- I think that would get your 3.7 / 1.4.5 setup working.

@gingercookiemage
Copy link
Author

Also, you might try setting ssl_check_hostname=False -- I think that would get your 3.7 / 1.4.5 setup working.

Adding this option works for 3.6/ 1.4.5 resolving above ssl.CertificateError. But with python 3.7 I keep getting the same log from the #1741 (comment) . The certificate files are valid as I succeeded sending messages to kafka topic with 3.6/1.4.4/ and 1.4.5(with ssl_check_hostname=False option).
I will try to dig something up about this, because I need it to work in docker py3.7 images.

@dpkp
Copy link
Owner

dpkp commented Mar 16, 2019

Interesting... Are you able to hotpatch kafka/conn.py ? If so, try removing these lines from 1.4.5 and run w/ py 3.7 (i expect that it will still fail, but we should get more insight into the underlying exception):

kafka/conn.py

 489         # python 3.7 throws OSError
 490         except OSError:
 491             pass

We probably should not have accepted this PR. I'm sorry about that. But again I want to emphasize that SSL connections are currently untested and will remain so until some kind soul such as yourself helps us out by submitting a PR to setup SSL broker testing via our test/fixtures.py ...

@dpkp
Copy link
Owner

dpkp commented Mar 16, 2019

Looking deeper into cpython ssl implementation, it looks like a change in 3.7 may have simply broken non-blocking ssl connections where the first connect_ex attempt does not succeed: python/cpython#5252

sigh. If I'm reading ssl.py correctly, I don't think there will be an easy workaround for this. The best I can think of is to somehow switch to blocking connects when using ssl.

@gingercookiemage
Copy link
Author

gingercookiemage commented Mar 16, 2019

Interesting... Are you able to hotpatch kafka/conn.py ? If so, try removing these lines from 1.4.5 and run w/ py 3.7 (i expect that it will still fail, but we should get more insight into the underlying exception):

kafka/conn.py

 489         # python 3.7 throws OSError
 490         except OSError:
 491             pass

We probably should not have accepted this PR. I'm sorry about that. But again I want to emphasize that SSL connections are currently untested and will remain so until some kind soul such as yourself helps us out by submitting a PR to setup SSL broker testing via our test/fixtures.py ...

The error excepted in this place was:

ERROR       - Uncaught error in kafka producer I/O thread [kafka.producer.sender:62]
Traceback (most recent call last):
  File "/Users/test/envs/test37/lib/python3.7/site-packages/kafka/producer/sender.py", line 60, in run
    self.run_once()
  File "/Users/test/envs/test37/lib/python3.7/site-packages/kafka/producer/sender.py", line 159, in run_once
    self._client.poll(poll_timeout_ms)
  File "/Users/test/envs/test37/lib/python3.7/site-packages/kafka/client_async.py", line 567, in poll
    self._maybe_connect(node_id)
  File "/Users/test/envs/test37/lib/python3.7/site-packages/kafka/client_async.py", line 383, in _maybe_connect
    conn.connect()
  File "/Users/test/envs/test37/lib/python3.7/site-packages/kafka/conn.py", line 413, in connect
    if self._try_handshake():
  File "/Users/test/envs/test37/lib/python3.7/site-packages/kafka/conn.py", line 484, in _try_handshake
    self._sock.do_handshake()
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1117, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: UNKNOWN_STATE] unknown state (_ssl.c:1056)

@gingercookiemage
Copy link
Author

gingercookiemage commented Mar 17, 2019

I've managed to patch v1.4.4 to work with py3.7 by adding ValueError except around connect_ex in kafka.conn.py:

       try:
            ret = self._sock.connect_ex(self._sock_addr)
        except ValueError as e:
            log.debug('Socket connection error: %s', e)

and OSError execept around do_handshake:

  try:
       self._sock.do_handshake()
       return True
  except OSError:
       pass

This fixed all problems for me. The new release 1.4.5 already has these excepts, so I suppose some new bugs were introduced through PR. I will try to localize them.

@dpkp
Copy link
Owner

dpkp commented Mar 17, 2019 via email

@gingercookiemage
Copy link
Author

Hey, I resolved my problem with #1745 . The problem was in the _generate_bootstrap_brokers function.

@dpkp
Copy link
Owner

dpkp commented Mar 20, 2019

Right on. Unfortunately your changes to _generate_bootstrap_brokers would break in other contexts, so it is not a general solution.

@dpkp dpkp changed the title Give up sending metadata request since no node is available infinite loop Cannot bootstrap with SSL using 1.4.5 on python 3.7 Mar 20, 2019
@gingercookiemage
Copy link
Author

After some debugging I think I understand the underlying problem here. From version 3.7 python uses openssl to match hostnames during SSLSocket.do_handshake() (https://docs.python.org/3/library/ssl.html.) You can not provide ip_address as server_hostname to wrap_socket() it must be domain name, though this worked up until python3.6. (https://bugs.python.org/issue34440).
When I logged all the errors around do_handshake() in kafka.conn.py I got:

DEBUG - <BrokerConnection node_id=bootstrap host=xx.xx.xxx.xxx:9093 <connecting> [IPv4 ('xx.xx.xxx.xxx', 9093)]>: initiating SSL handshake [kafka.conn:389] ERROR - The operation did not complete (read) (_ssl.c:1056) [kafka.conn:495] ERROR - [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: IP address mismatch, certificate is not valid for 'xx.xx.xxx.xxx'. (_ssl.c:1056) [kafka.conn:499] ERROR - [SSL: UNKNOWN_STATE] unknown state (_ssl.c:1056) [kafka.conn:499]
By calling dns_lookup in __generate_bootstrap_brokers and yielding BrokerMetadata(...host=ip...) you implicitly set up BrokerConnection with host=ip (kafka.client_async._maybe_connect) and this creates problems later on during ssl handshake if ssl_check_hostname option is set to True which is by default.

P.S. By setting ssl_check_hostname=False option things will work out with py3.7 but this is bypassing the problem.(I made a mistake in earlier post by saying this works only on versions less than 3.7)

@dpkp
Copy link
Owner

dpkp commented Mar 20, 2019 via email

@dpkp
Copy link
Owner

dpkp commented Mar 22, 2019

Should be fixed on master. You may need to configure ssl_ciphers if the default ciphers on your python installation dont match your kafka brokers (ssl_ciphers='ALL' should work, but may enable / use insecure ciphers)

@dpkp dpkp closed this as completed Mar 22, 2019
@gingercookiemage
Copy link
Author

Hey, thanks a lot man. When are you planning to release this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants