Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle OSError to properly recycle SSL connection, fix infinite loop #2100

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

sibiryakov
Copy link
Contributor

@sibiryakov sibiryakov commented Aug 7, 2020

Here's a stack trace we had our logs flooded with.

[07/15/2020 08:51:14.799: ERROR/kafka.producer.sender] Uncaught error in kafka producer I/O thread
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/kafka/producer/sender.py", line 60, in run
self.run_once()
File "/usr/local/lib/python3.6/site-packages/kafka/producer/sender.py", line 160, in run_once
self._client.poll(timeout_ms=poll_timeout_ms)
File "/usr/local/lib/python3.6/site-packages/kafka/client_async.py", line 580, in poll
self._maybe_connect(node_id)
File "/usr/local/lib/python3.6/site-packages/kafka/client_async.py", line 390, in _maybe_connect
conn.connect()
File "/usr/local/lib/python3.6/site-packages/kafka/conn.py", line 426, in connect
if self._try_handshake():
File "/usr/local/lib/python3.6/site-packages/kafka/conn.py", line 505, in _try_handshake
self._sock.do_handshake()
File "/usr/local/lib/python3.6/ssl.py", line 1077, in do_handshake
self._sslobj.do_handshake()
File "/usr/local/lib/python3.6/ssl.py", line 689, in do_handshake
self._sslobj.do_handshake()
OSError: [Errno 0] Error

The problem is Python 3.6 is returning OSError, which is not expected. Such exception is propagated to the caller and code making recycling of such connection is not executed. Therefore, Producer is guaranteed to get the same exception on a next call to poll().

Throwing of OSError doesn't seem to be documented even in latest Python docs. See 3.8 docs, but there are signs of it in 3.8 source code.


This change is Reviewable

@sibiryakov
Copy link
Contributor Author

anything else needed here @dpkp and @jeffwidman ?

@dpkp
Copy link
Owner

dpkp commented Sep 7, 2020

OSError is a very broad exception and has different coverage in different python versions. Can we narrow this down at all?

@sibiryakov
Copy link
Contributor Author

@dpkp we can't. I have seen it (not its subclass) thrown in my application and Python source code also suggests that it can be thrown in certain situation. I'm afraid we would have to apply this due to way Python is developed.

@dpkp
Copy link
Owner

dpkp commented Sep 30, 2020

It looks like this is a python bug that has been resolved in recent versions of python: https://bugs.python.org/issue31122

I'd prefer to try to limit this to cases where known buggy versions of python are running, and also to limit it to errno=0 (note the stacktrace OSError: [Errno 0] Error)

@ErikBrewster
Copy link

Could we just add a case to the exception handler in conn.py (_try_handshake.py:513)? I think the self.close here is the key to handling this case properly. Otherwise, it appears that the connection error gets recycled, not the connection itself, leaving us with an endless loop of logging the error.

        except OSError as ex:
            if ex.errno == 0:
                log.warning('SSL connection closed by server during handshake: OSError 0.')
                self.close(Errors.KafkaConnectionError('SSL connection closed by server during handshake'))
           else:
               raise

I've been trying to fix the OSError issue by moving my project from Python 3.6.x to 3.9.6 and it seems to fix it. I'm seeing this trio of log entries in my log where previously it would get into the endless loop OSError issue. I see many of these trios consecutively, separated by typ. < 100 mS. I take this to mean the network got unhealthy for some reason.

INFO:2021-09-03 07:14:05,210:kafka.conn:380:<BrokerConnection node_id=2 host=myKafkaUrl:443 <connecting> [IPv4 ('x.x.x.x', 443)]>: connecting to myKafkaUrl:443 [('x.x.x.x', 443) IPv4]
WARNING:2021-09-03 07:14:05,225:kafka.conn:514:SSL connection closed by server during handshake.
INFO:2021-09-03 07:14:05,225:kafka.conn:919:<BrokerConnection node_id=2 host=myKafkaUrl:443 <handshake> [IPv4 ('x.x.x.x', 443)]>: Closing connection. KafkaConnectionError: SSL connection closed by server during handshake
WARNING:2021-09-03 07:14:05,226:kafka.client:331:Node 2 connection failed -- refreshing metadata

I think this means that I'm seeing an SSLEOFError (instead of OSError 0) -- good, but I wish we could make the logs a bit more descriptive

So maybe something like this (new error case and more descriptive logging)?

# old ssl in python2.6 will swallow all SSLErrors here...
        except (SSLWantReadError, SSLWantWriteError):
            pass
        except (SSLZeroReturnError, ConnectionError, TimeoutError, SSLEOFError) as ex:
            log.warning('SSL connection closed by server during handshake: %s', type(ex).__name__)
            self.close(Errors.KafkaConnectionError('SSL connection closed by server during handshake'))
        except OSError as ex:
            if ex.errno == 0:
                log.warning('SSL connection closed by server during handshake: OSError 0.')
                self.close(Errors.KafkaConnectionError('SSL connection closed by server during handshake'))
           else:
               raise
        # Other SSLErrors will be raised to user

@wbarnha
Copy link
Collaborator

wbarnha commented Aug 3, 2023

https://bugs.python.org/msg375481 notes that the aforementioned bug had a patch propagated into Python 3.8+. Assuming we resume supporting the library for Python 3.8+, we could revisit this if it's still encountered in newer Python versions. Otherwise, we can close this PR.

@wbarnha wbarnha added the wontfix label Aug 9, 2023
wbarnha and others added 16 commits March 7, 2024 10:31
…terations for Kafka 0.8.2 and Python 3.12 (dpkp#159)

* skip failing tests for PyPy since they work locally

* Reconfigure tests for PyPy and 3.12

* Skip partitioner tests in test_partitioner.py if 3.12 and 0.8.2

* Update test_partitioner.py

* Update test_producer.py

* Timeout tests after ten minutes

* Set 0.8.2.2 to be experimental from hereon

* Formally support PyPy 3.9
* Test Kafka 0.8.2.2 using Python 3.11 in the meantime

* Override PYTHON_LATEST conditionally in python-package.yml

* Update python-package.yml

* add python annotation to kafka version test matrix

* Update python-package.yml

* try python 3.10
* Remove support for EOL'ed versions of Python

* Update setup.py
Too many MRs to review... so little time.
After stop/start kafka service, kafka-python may use 100% CPU caused by
busy-retry while the socket was closed. This fix the issue by unregister
the socket if the fd is negative.

Co-authored-by: Orange Kao <[email protected]>
The former has been deprecated since setuptools 56

Co-authored-by: micwoj92 <[email protected]>
* docs: Update syntax in README.rst

* docs: Update code block syntax in docs/index.rst

---------

Co-authored-by: HalfSweet <[email protected]>
* Fix crc32c's __main__ for Python 3

* Remove TODO from _crc32c.py

---------

Co-authored-by: Yonatan Goldschmidt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants