Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EG becomes unusable after receiving FD added twice error due to a race condition! #1051

Closed
rahul26goyal opened this issue Mar 24, 2022 · 3 comments
Labels

Comments

@rahul26goyal
Copy link
Contributor

rahul26goyal commented Mar 24, 2022

Description

This issue is related to #1047

We are seeing ZMQStream FD leak happening on Jupyter Enterprise Gateway Server while running kernels on Kubernetes. We are running native spark Kubernetes kernels.

Based on the analysis done below, the leak is happening at the jupyter application layer which integrates with Tornado IOLoop to manage the ZMQSocket Streams. add_handler

At an high level, the leak happens when a race condition between a Kernel Restart and Shutdown for the same kernel happens and this leads to a FD leak for a duration of 1 minute.

This issue is seen only on remote kernels and not on the local kernels due to various differences that come with remote kernels.

Sample exception trace from one such occurrence!

[I 2022-02-23 04:23:06.531 EnterpriseGatewayApp] Kernel started: 4160c20e-c2ee-4df5-8c28-3716d135de5d
[I 220223 04:23:06 web:2243] 201 POST /api/kernels (127.0.0.1) 8313.27ms
[I 2022-02-23 04:23:07.566 EnterpriseGatewayApp] successfully validated request: /api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels
[W 2022-02-23 04:23:07.567 EnterpriseGatewayApp] No session ID specified
[E 220223 04:23:08 web:1793] Uncaught exception GET /api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels (172.18.96.19)
    HTTPServerRequest(protocol='https', host='localhost:18888', method='GET', uri='/api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/rhr/notebook-env/lib/python3.7/site-packages/tornado/websocket.py", line 954, in _accept_connection
        open_result = handler.open(*handler.open_args, **handler.open_kwargs)
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 275, in open
        self.create_stream()
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 128, in create_stream
        self.channels[channel] = stream = meth(self.kernel_id, identity=identity)
      File "/rhr/notebook-env/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py", line 34, in wrapped
        r = method(*args, **kwargs)
      File "/rhr/notebook-env/lib/python3.7/site-packages/jupyter_client/ioloop/manager.py", line 21, in wrapped
        return ZMQStream(socket, self.loop)
      File "/rhr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 113, in __init__
        self._init_io_state()
      File "/rhr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 540, in _init_io_state
        self.io_loop.add_handler(self.socket, self._handle_events, self.io_loop.READ)
      File "/rhr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 147, in add_handler
        raise ValueError("fd %s added twice" % fd)
 *ValueError: fd 28 added twice*

more on this below!

Reproduce

Since there is a race condition involved here, the scenario is not easy to reproduce. But we have been able to reproduce this issue multiple times by doing the following steps:

  1. Start a new Kernel
  2. Trigger Restart Kernel from Notebook UI
  3. Immediately followed by 2, trigger Shutdown kernel from notebook UI
  4. Until the leak error is seen on JEG logs, keep repeating step 2 and 3

Diagnosis of the issue

Given below are the log lines from one such scenario which we have captured and analyzed in depth. In order to do that, we have to add new log lines in multiple places across different code packages! So, you will see log lines which may not look familiar! 😜
Comments for each events are available inline in the below logs!
I have also removed some log lines which were not relevant to the issue.

# Leaked FD scenario JE logs
[I 2022-03-23 11:21:57.192 EnterpriseGatewayApp] validating incoming request: GET: /api/kernelspecs
[I 220323 11:21:57 web:2243] 200 GET /api/kernelspecs (127.0.0.1) 2.29ms

# Starting a new Kernel 

[I 2022-03-23 11:21:58.581 EnterpriseGatewayApp] validating incoming request: POST: /api/kernels
[I 2022-03-23 11:21:58.584 EnterpriseGatewayApp] KERNEL_NAMESPACE provided by client: xxxx-kube-namespace
[I 2022-03-23 11:21:58.588 EnterpriseGatewayApp] KubernetesProcessProxy: kernel launched. Kernel image: ...
Starting IPython kernel for Spark in Kubernetes mode on behalf of user xxx-notebook
.....
# a new socket opened with kernel to iopub_channel for monitoring activity.
[W 220323 11:22:06 zmqstream:114] creaed a new ZMQ socket with fd: 17: <zmq.Socket(zmq.SUB) at 0x7f269c0c22f0>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:06.778 EnterpriseGatewayApp] Kernel started: 41c5fd50-bc44-4f79-b7fc-0f7ee400639a
[I 220323 11:22:06 web:2243] 201 POST /api/kernels (127.0.0.1) 8197.82ms
# completion of kernel start

# Base FD (4,5, 17) => 17 is coming from kernel start (iopub_channel)
# notebook validating kernel exists
[I 2022-03-23 11:22:08.162 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a
# new loggin added to print current list of FD known to tornado io_loop
[I 2022-03-23 11:22:08.163 EnterpriseGatewayApp] GET Kernel call.. IOLOOP: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:08.163 EnterpriseGatewayApp] FD: 4: file obj: <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('0.0.0.0', 9547)>: handler: accept_handler: <function add_accept_handler.<locals>.accept_handler at 0x7f26ae0edb90>
[I 2022-03-23 11:22:08.163 EnterpriseGatewayApp] FD: 17: file obj: <zmq.Socket(zmq.SUB) at 0x7f269c0c22f0>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f26ad4770d0>>
[I 2022-03-23 11:22:08.163 EnterpriseGatewayApp] FD: 5: file obj: <ssl.SSLSocket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39018)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f26ae008290>>
[I 2022-03-23 11:22:08.163 EnterpriseGatewayApp] *COMPLETED pringing FD* =============
[I 220323 11:22:09 web:2243] 200 GET /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a (127.0.0.1) 1.54ms

# web socket request received for 41c5fd50-bc44-4f79-b7fc-0f7ee400639a/channels
[I 2022-03-23 11:22:10.942 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/channels
[W 2022-03-23 11:22:10.942 EnterpriseGatewayApp] No session ID specified


# created a socket to read kernl info and closed 
[W 220323 11:22:10 zmqstream:114] creaed a new ZMQ socket with fd: 19: <zmq.Socket(zmq.XREQ) at 0x7f269c0ed9f0>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 220323 11:22:10 web:2243] 101 GET /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/channels (127.0.0.1) 9.19ms
# opening 4 new sockets to kennel (19, 21, 23, 25) => ZMQ 
[I 2022-03-23 11:22:10.950 EnterpriseGatewayApp] creating channel for shell
[W 220323 11:22:10 zmqstream:114] creaed a new ZMQ socket with fd: 19: <zmq.Socket(zmq.XREQ) at 0x7f269c0ed9f0>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:10.950 EnterpriseGatewayApp] FD used for channel: shell: fd: 19
[I 2022-03-23 11:22:10.951 EnterpriseGatewayApp] creating channel for control
[W 220323 11:22:10 zmqstream:114] creaed a new ZMQ socket with fd: 21: <zmq.Socket(zmq.XREQ) at 0x7f26aa940980>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:10.951 EnterpriseGatewayApp] FD used for channel: control: fd: 21
[I 2022-03-23 11:22:10.951 EnterpriseGatewayApp] creating channel for iopub
[W 220323 11:22:10 zmqstream:114] creaed a new ZMQ socket with fd: 23: <zmq.Socket(zmq.SUB) at 0x7f269c100f30>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:10.951 EnterpriseGatewayApp] FD used for channel: iopub: fd: 23
[I 2022-03-23 11:22:10.951 EnterpriseGatewayApp] creating channel for stdin
[W 220323 11:22:10 zmqstream:114] creaed a new ZMQ socket with fd: 25: <zmq.Socket(zmq.XREQ) at 0x7f269d583600>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:10.952 EnterpriseGatewayApp] FD used for channel: stdin: fd: 25

# Current FDs (4,5,17) + (19, 21,23,25) as below and all sockets are active
[I 2022-03-23 11:22:22.754 EnterpriseGatewayApp] Starting buffering for 41c5fd50-bc44-4f79-b7fc-0f7ee400639a:05dcdfac-bf00525aebe62619a0b166be
[I 2022-03-23 11:22:23.860 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a
[I 2022-03-23 11:22:23.860 EnterpriseGatewayApp] GET Kernel call..Webapp IOLOOP: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:23.860 EnterpriseGatewayApp] FD: 4: file obj: <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('0.0.0.0', 9547)>: handler: accept_handler: <function add_accept_handler.<locals>.accept_handler at 0x7f26ae0edb90>
[I 2022-03-23 11:22:23.860 EnterpriseGatewayApp] FD: 17: file obj: <zmq.Socket(zmq.SUB) at 0x7f269c0c22f0>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f26ad4770d0>>
[I 2022-03-23 11:22:23.860 EnterpriseGatewayApp] FD: 19: file obj: <zmq.Socket(zmq.XREQ) at 0x7f269c0ed9f0>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c0d0410>>
[I 2022-03-23 11:22:23.860 EnterpriseGatewayApp] FD: 21: file obj: <zmq.Socket(zmq.XREQ) at 0x7f26aa940980>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c0d0c50>>
[I 2022-03-23 11:22:23.860 EnterpriseGatewayApp] FD: 23: file obj: <zmq.Socket(zmq.SUB) at 0x7f269c100f30>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c0d0d50>>
[I 2022-03-23 11:22:23.861 EnterpriseGatewayApp] FD: 25: file obj: <zmq.Socket(zmq.XREQ) at 0x7f269d583600>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c0d0810>>
[I 2022-03-23 11:22:23.861 EnterpriseGatewayApp] FD: 5: file obj: <ssl.SSLSocket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39212)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697f95d10>>
[I 2022-03-23 11:22:23.861 EnterpriseGatewayApp] *COMPLETED pringing FD* *=*============
[I 220323 11:22:25 web:2243] 200 GET /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a (127.0.0.1) 1.89ms

# Kernel restart request received 
[I 2022-03-23 11:22:25.431 EnterpriseGatewayApp] validating incoming request: POST: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/restart
# successfully terminated.
[W 2022-03-23 11:22:31.591 EnterpriseGatewayApp] KubernetesProcessProxy.terminate_container_resources, pod: rhr-kube-namespace.k41c5fd50-bc44-4f79-b7fc-0f7ee400639a-c7e2cd7fb6831c3a-driver, kernel ID: 41c5fd50-bc44-4f79-b7fc-0f7ee400639a has been terminated.
[I 2022-03-23 11:22:31.624 EnterpriseGatewayApp] KERNEL_NAMESPACE provided by client: xxx-kube-namespace
[I 2022-03-23 11:22:31.629 EnterpriseGatewayApp] KubernetesProcessProxy: kernel launched. Kernel image: ....
Starting IPython kernel for Spark in Kubernetes mode on behalf of user rhr-notebook
# new activity socket opened for iopub_channel...
[W 220323 11:22:39 zmqstream:114] creaed a new ZMQ socket with fd: 26: <zmq.Socket(zmq.SUB) at 0x7f269c04cb40>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:39.133 EnterpriseGatewayApp] Kernel restarted: 41c5fd50-bc44-4f79-b7fc-0f7ee400639a
# what is this fD opened for ?????? Ans: for reading kernel_info_request over the shell_channel
[W 220323 11:22:39 zmqstream:114] created a new ZMQ socket with fd: 24: <zmq.Socket(zmq.XREQ) at 0x7f269d540bb0>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
# kernel restarted succesfull but no response yet for API as we are still waiting for kernel info

# Current FDs (4,5,17) + (19, 21,23,25) + (24 + 26)
# JEG started to process shutdown request as the loop freed up
[I 2022-03-23 11:22:39.690 EnterpriseGatewayApp] validating incoming request: DELETE: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a

[I 2022-03-23 11:22:39.690 EnterpriseGatewayApp] FD before shutdown...
[I 2022-03-23 11:22:39.690 EnterpriseGatewayApp] Webapp IOLOOP: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:39.690 EnterpriseGatewayApp] FD: 4: file obj: <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('0.0.0.0', 9547)>: handler: accept_handler: <function add_accept_handler.<locals>.accept_handler at 0x7f26ae0edb90>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 19: file obj: <zmq.Socket(zmq.XREQ) at 0x7f269c0ed9f0>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c0d0410>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 21: file obj: <zmq.Socket(zmq.XREQ) at 0x7f26aa940980>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c0d0c50>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 23: file obj: <zmq.Socket(zmq.SUB) at 0x7f269c100f30>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c0d0d50>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 25: file obj: <zmq.Socket(zmq.XREQ) at 0x7f269d583600>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c0d0810>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 27: file obj: <ssl.SSLSocket fd=27, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39218)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697f95510>>

[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 26: file obj: <zmq.Socket(zmq.SUB) at 0x7f269c04cb40>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269d581350>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] *FD**:* ** *24**:* file obj: <zmq.Socket(zmq.XREQ) at 0x7f269d540bb0>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269d581850>>

[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 28: file obj: <ssl.SSLSocket fd=28, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39236)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f269d5aead0>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 29: file obj: <ssl.SSLSocket fd=29, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39278)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697ff4110>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 30: file obj: <ssl.SSLSocket fd=30, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39558)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697ff4a10>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 31: file obj: <ssl.SSLSocket fd=31, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39772)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697ff4e90>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] FD: 35: file obj: <ssl.SSLSocket fd=35, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39930)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697ff5950>>
[I 2022-03-23 11:22:39.691 EnterpriseGatewayApp] COMPLETED printing FD =============
[I 2022-03-23 11:22:39.692 EnterpriseGatewayApp] Discarding 3 buffered messages for 41c5fd50-bc44-4f79-b7fc-0f7ee400639a:05dcdfac-bf00525aebe62619a0b166be
[I 2022-03-23 11:22:39.692 EnterpriseGatewayApp] Kernel shutdown: 41c5fd50-bc44-4f79-b7fc-0f7ee400639a
[W 2022-03-23 11:22:45.968 EnterpriseGatewayApp] Unable to delete pod: {'api_version': 'v1',

[W 2022-03-23 11:22:45.969 EnterpriseGatewayApp] KubernetesProcessProxy.terminate_container_resources, pod: xxx-kube-namespace.k41c5fd50-bc44-4f79-b7fc-0f7ee400639a-3a8fae7fb6839d52-driver, kernel ID: 41c5fd50-bc44-4f79-b7fc-0f7ee400639a has been terminated.
[I 2022-03-23 11:22:46.001 EnterpriseGatewayApp] FD after shutdown...
[I 2022-03-23 11:22:46.002 EnterpriseGatewayApp] Webapp IOLOOP: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:46.002 EnterpriseGatewayApp] FD: 4: file obj: <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('0.0.0.0', 9547)>: handler: accept_handler: <function add_accept_handler.<locals>.accept_handler at 0x7f26ae0edb90>
[I 2022-03-23 11:22:46.002 EnterpriseGatewayApp] FD: 27: file obj: <ssl.SSLSocket fd=27, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39218)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697f95510>>

# Leaky FD detected..
*[**I* *2022**-**03**-**23* ** *11**:**22**:**46.002* ** *EnterpriseGatewayApp**]* ** ****FD****:** ** **** ** ****24****:** ** ****file obj****:** ** **** ** **<****zmq****.****Socket****(****zmq****.****XREQ****)** ** ****at**** ** ****0x7f269d540bb0**** ** ****closed**>:* handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269d581850>>
[I 2022-03-23 11:22:46.002 EnterpriseGatewayApp] FD: 28: file obj: <ssl.SSLSocket fd=28, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39236)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f269d5aead0>>
[I 2022-03-23 11:22:46.002 EnterpriseGatewayApp] FD: 29: file obj: <ssl.SSLSocket fd=29, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39278)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697ff4110>>
[I 2022-03-23 11:22:46.002 EnterpriseGatewayApp] FD: 30: file obj: <ssl.SSLSocket fd=30, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39558)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697ff4a10>>
[I 2022-03-23 11:22:46.002 EnterpriseGatewayApp] FD: 31: file obj: <ssl.SSLSocket fd=31, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39772)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697ff4e90>>
[I 2022-03-23 11:22:46.002 EnterpriseGatewayApp] FD: 35: file obj: <ssl.SSLSocket fd=35, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39930)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697ff5950>>
[I 2022-03-23 11:22:46.003 EnterpriseGatewayApp] COMPLETED pringing FD =============
[I 2022-03-23 11:22:46.003 EnterpriseGatewayApp] fd which got closed: {19, 21, 23, 25, 26}
[I 2022-03-23 11:22:46.003 EnterpriseGatewayApp] fd which got added: set()
[I 2022-03-23 11:22:46.003 EnterpriseGatewayApp] fd which remained: { 35, 4, 24, 27, 28, 29, 30, 31 }
[I 220323 11:22:46 web:2243] 204 DELETE /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a (127.0.0.1) 6314.11ms
# shutdown successfully

[I 2022-03-23 11:22:46.006 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a

[I 2022-03-23 11:22:46.006 EnterpriseGatewayApp] successfully validated request: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a
[W 220323 11:22:46 web:1787] 404 GET /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a (127.0.0.1): Kernel does not exist: 41c5fd50-bc44-4f79-b7fc-0f7ee400639a
[W 220323 11:22:46 web:2243] 404 GET /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a (127.0.0.1) 2.61ms

# restart on non existing kernel.. 41c5fd50-bc44-4f79-b7fc-0f7ee400639a
[I 2022-03-23 11:22:46.007 EnterpriseGatewayApp] validating incoming request: POST: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/restart
[I 2022-03-23 11:22:46.007 EnterpriseGatewayApp] successfully validated request: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/restart
[E 2022-03-23 11:22:46.008 EnterpriseGatewayApp] Exception restarting kernel
    Traceback (most recent call last):
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 108, in post
    .....
    tornado.web.HTTPError: HTTP 404: Not Found (Kernel does not exist: 41c5fd50-bc44-4f79-b7fc-0f7ee400639a)
[E 220323 11:22:46 web:2243] 500 POST /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/restart (127.0.0.1) 3.59ms

.....all subsequent retsrart for same kernel ID got 404 resposne
----

# start kernel request received...
[I 2022-03-23 11:22:47.646 EnterpriseGatewayApp] validating incoming request: POST: /api/kernels
[I 2022-03-23 11:22:47.648 EnterpriseGatewayApp] KERNEL_NAMESPACE provided by client: xxx-kube-namespace
[I 2022-03-23 11:22:47.653 EnterpriseGatewayApp] KubernetesProcessProxy: kernel launched. Kernel image: 895885662937.dkr.ecr.us-west-2.amazonaws.com/notebook-spark/rhr-6.2.0:latest, KernelID: 86f022e9-5383-4c0e-8d9f-3fa807d8996b, cmd: '['/usr/local/share/jupyter/kernels/spark_python_kubernetes/bin/run.sh', '--RemoteProcessProxy.kernel-id', '86f022e9-5383-4c0e-8d9f-3fa807d8996b', '--RemoteProcessProxy.response-address', '192.168.3.230:46429', '--RemoteProcessProxy.spark-context-initialization-mode', 'lazy']'

Starting IPython kernel for Spark in Kubernetes mode on behalf of user rhr-notebook

+
22/03/23 11:22:51 INFO ShutdownHookManager: Deleting directory /tmp/spark-f5cb7762-89a0-44b0-b7f9-3044d35bb484
[W 220323 11:22:55 zmqstream:114] creaed a new ZMQ socket with fd: 19: <zmq.Socket(zmq.SUB) at 0x7f269d583de0>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:55.144 EnterpriseGatewayApp] Kernel started: 86f022e9-5383-4c0e-8d9f-3fa807d8996b
[I 220323 11:22:55 web:2243] 201 POST /api/kernels (127.0.0.1) 7500.26ms
# new kernel 86f022e9-5383-4c0e-8d9f-3fa807d8996b started successfully..



[I 2022-03-23 11:22:55.707 EnterpriseGatewayApp] validating incoming request: POST: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/restart

[I 2022-03-23 11:22:55.707 EnterpriseGatewayApp] successfully validated request: /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/restart
[E 2022-03-23 11:22:55.708 EnterpriseGatewayApp] Exception restarting kernel
    Traceback (most recent call last):
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 108, in post
        yield maybe_future(km.restart_kernel(kernel_id))
      File "/rhr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
        value = future.result()
      File "/rhr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 234, in wrapper
        yielded = ctx_run(next, result)
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 307, in restart_kernel
        self._check_kernel_id(kernel_id)
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 387, in _check_kernel_id
        raise web.HTTPError(404, u'Kernel does not exist: %s' % kernel_id)
    tornado.web.HTTPError: HTTP 404: Not Found (Kernel does not exist: 41c5fd50-bc44-4f79-b7fc-0f7ee400639a)
[E 220323 11:22:55 web:2243] 500 POST /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/restart (127.0.0.1) 2.35ms

# New base for FD (4, 9, 12, 19, 24, 27)  => 19 from kernel start, 24 is the leak FD from previous kernel.

[I 2022-03-23 11:22:56.528 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/86f022e9-5383-4c0e-8d9f-3fa807d8996b
f
[I 2022-03-23 11:22:56.529 EnterpriseGatewayApp] successfully validated request: /api/kernels/86f022e9-5383-4c0e-8d9f-3fa807d8996b
[I 2022-03-23 11:22:56.529 EnterpriseGatewayApp] GET Kernel call..Webapp IOLOOP: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:56.529 EnterpriseGatewayApp] FD: 4: file obj: <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('0.0.0.0', 9547)>: handler: accept_handler: <function add_accept_handler.<locals>.accept_handler at 0x7f26ae0edb90>
[I 2022-03-23 11:22:56.529 EnterpriseGatewayApp] FD: 27: file obj: <ssl.SSLSocket fd=27, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 39218)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f2697f95510>>
[I 2022-03-23 11:22:56.529 EnterpriseGatewayApp] FD: 24: file obj: <zmq.Socket(zmq.XREQ) at 0x7f269d540bb0 closed>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269d581850>>
[I 2022-03-23 11:22:56.529 EnterpriseGatewayApp] FD: 19: file obj: <zmq.Socket(zmq.SUB) at 0x7f269d583de0>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f2697fe9f50>>
[I 2022-03-23 11:22:56.529 EnterpriseGatewayApp] FD: 9: file obj: <ssl.SSLSocket fd=9, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 40392)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f26ae1639d0>>
[I 2022-03-23 11:22:56.529 EnterpriseGatewayApp] FD: 12: file obj: <ssl.SSLSocket fd=12, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 40394)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f26ae101790>>
[I 2022-03-23 11:22:56.529 EnterpriseGatewayApp] COMPLETED pringing FD =============
[I 220323 11:22:56 web:2243] 200 GET /api/kernels/86f022e9-5383-4c0e-8d9f-3fa807d8996b (127.0.0.1) 1.87ms


# received web socket request for 86f022e9-5383-4c0e-8d9f-3fa807d8996b/channels

[I 2022-03-23 11:22:59.306 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/86f022e9-5383-4c0e-8d9f-3fa807d8996b/channels
[
[I 2022-03-23 11:22:59.306 EnterpriseGatewayApp] successfully validated request: /api/kernels/86f022e9-5383-4c0e-8d9f-3fa807d8996b/channels
[W 2022-03-23 11:22:59.307 EnterpriseGatewayApp] No session ID specified

# creating new  socket streams.. (20, 22, 24) ---errrrrrrr -> fd leak..

[W 220323 11:22:59 zmqstream:114] creaed a new ZMQ socket with fd: 20: <zmq.Socket(zmq.XREQ) at 0x7f269c0c2910>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 220323 11:22:59 web:2243] 101 GET /api/kernels/86f022e9-5383-4c0e-8d9f-3fa807d8996b/channels (127.0.0.1) 11.18ms
[I 2022-03-23 11:22:59.316 EnterpriseGatewayApp] creating channel for shell
[W 220323 11:22:59 zmqstream:114] creaed a new ZMQ socket with fd: 20: <zmq.Socket(zmq.XREQ) at 0x7f269c0c2d00>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:59.316 EnterpriseGatewayApp] FD used for channel: shell: fd: 20
[I 2022-03-23 11:22:59.316 EnterpriseGatewayApp] creating channel for control
[W 220323 11:22:59 zmqstream:114] creaed a new ZMQ socket with fd: 22: <zmq.Socket(zmq.XREQ) at 0x7f269d5dbec0>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:22:59.317 EnterpriseGatewayApp] FD used for channel: control: fd: 22
# trying to reuse the leaked socket again from OS.
[I 2022-03-23 11:22:59.317 EnterpriseGatewayApp] creating channel for iopub
[W 220323 11:22:59 zmqstream:114] creaed a new ZMQ socket with fd: 24: <zmq.Socket(zmq.SUB) at 0x7f269c0acb40>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[E 2022-03-23 11:22:59.317 EnterpriseGatewayApp] Error opening stream: fd 24 added twice <zmq.Socket(zmq.SUB) at 0x7f269c0acb40>
    Traceback (most recent call last):
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 305, in open
        self.create_stream()
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 154, in create_stream
        self.channels[channel] = stream = meth(self.kernel_id, identity=identity)
      File "/rhr/notebook-env/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py", line 34, in wrapped
        r = method(*args, **kwargs)
      File "/rhr/notebook-env/lib/python3.7/site-packages/jupyter_client/ioloop/manager.py", line 21, in wrapped
        return ZMQStream(socket, self.loop)
      File "/rhr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 115, in __init__
        self._init_io_state()
      File "/rhr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 541, in _init_io_state
        self.io_loop.add_handler(self.socket, self._handle_events, self.io_loop.READ)
      File "/rhr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 147, in add_handler
        raise ValueError(f"fd {fd} added twice {fileobj}")
    ValueError: fd 24 added twice <zmq.Socket(zmq.SUB) at 0x7f269c0acb40>
# DETECTED issUE..
# For the next 1 min, all the request which tried to make use of FD 24 failed as it was locked by restart hook..
# while the socket is locked at application layer, at OS layer the FD no longer exists under the process's FD list! 

[I 220323 11:23:32 web:2243] 200 GET /api/swagger.json (192.168.178.112) 0.77ms
[I 220323 11:23:32 web:2243] 200 GET /api/swagger.json (192.168.119.68) 0.74ms
[I 220323 11:23:32 web:2243] 200 GET /api/swagger.json (192.168.146.106) 0.74ms

# after 1 min: From Restart handler the timeout for kernel_info occurs and the socket gets cleanup!

[W 2022-03-23 11:23:39.134 EnterpriseGatewayApp] Timeout waiting for kernel_info_reply: 41c5fd50-bc44-4f79-b7fc-0f7ee400639a
/rhr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py:423: UserWarning: Unregistering FD 24 after closing socket. This could result in unregistering handlers for the wrong socket. Please use stream.close() instead of closing the socket directly.
  self.close()
[E 2022-03-23 11:23:39.134 EnterpriseGatewayApp] Exception restarting kernel
    Traceback (most recent call last):
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 108, in post
        yield maybe_future(km.restart_kernel(kernel_id))
      File "/rhr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
        value = future.result()
      File "/rhr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 769, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "/rhr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 345, in restart_kernel
        yield future
      File "/rhr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
        value = future.result()
    tornado.util.TimeoutError: Timeout waiting for restart
[E 220323 11:23:39 web:2243] 500 POST /api/kernels/41c5fd50-bc44-4f79-b7fc-0f7ee400639a/restart (127.0.0.1) 73703.92ms


# The socket 24 becomes available for reuse for future requests
[I 2022-03-23 11:24:02.581 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/34687cc1-57bf-4a41-a8f6-b7eb23886075/channels
[W 2022-03-23 11:24:02.582 EnterpriseGatewayApp] No session ID specified
[W 220323 11:24:02 zmqstream:114] creaed a new ZMQ socket with fd: 19: <zmq.Socket(zmq.XREQ) at 0x7f269c055f30>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 220323 11:24:02 web:2243] 101 GET /api/kernels/34687cc1-57bf-4a41-a8f6-b7eb23886075/channels (127.0.0.1) 9.78ms
[I 2022-03-23 11:24:02.590 EnterpriseGatewayApp] creating channel for shell
[W 220323 11:24:02 zmqstream:114] creaed a new ZMQ socket with fd: 21: <zmq.Socket(zmq.XREQ) at 0x7f269c043750>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:24:02.591 EnterpriseGatewayApp] FD used for channel: shell: fd: 21
[I 2022-03-23 11:24:02.591 EnterpriseGatewayApp] creating channel for control
[W 220323 11:24:02 zmqstream:114] creaed a new ZMQ socket with fd: 22: <zmq.Socket(zmq.XREQ) at 0x7f269c043c20>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:24:02.591 EnterpriseGatewayApp] FD used for channel: control: fd: 22
[I 2022-03-23 11:24:02.591 EnterpriseGatewayApp] creating channel for iopub
[W 220323 11:24:02 zmqstream:114] creaed a new ZMQ socket with fd: 23: <zmq.Socket(zmq.SUB) at 0x7f269c043fa0>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:24:02.592 EnterpriseGatewayApp] FD used for channel: iopub: fd: 23
[I 2022-03-23 11:24:02.592 EnterpriseGatewayApp] creating channel for stdin
[W 220323 11:24:02 zmqstream:114] creaed a new ZMQ socket with fd: 24: <zmq.Socket(zmq.XREQ) at 0x7f269c0439f0>: loop: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:24:02.592 EnterpriseGatewayApp] FD used for channel: stdin: fd: 24
[I 2022-03-23 11:24:03.684 EnterpriseGatewayApp] Starting buffering for 34687cc1-57bf-4a41-a8f6-b7eb23886075:74c24ec2-e31c531cd35968a547e1f68b

[I 2022-03-23 11:24:04.849 EnterpriseGatewayApp] successfully validated request: /api/kernels/34687cc1-57bf-4a41-a8f6-b7eb23886075
[I 2022-03-23 11:24:04.849 EnterpriseGatewayApp] GET Kernel call..Webapp IOLOOP: <class 'tornado.platform.asyncio.AsyncIOMainLoop'>: 8737756610873
[I 2022-03-23 11:24:04.849 EnterpriseGatewayApp] FD: 4: file obj: <socket.socket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('0.0.0.0', 9547)>: handler: accept_handler: <function add_accept_handler.<locals>.accept_handler at 0x7f26ae0edb90>
[I 2022-03-23 11:24:04.849 EnterpriseGatewayApp] FD: 17: file obj: <zmq.Socket(zmq.SUB) at 0x7f269c0ac7c0>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f2697f40910>>
[I 2022-03-23 11:24:04.849 EnterpriseGatewayApp] FD: 21: file obj: <zmq.Socket(zmq.XREQ) at 0x7f269c043750>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f2697fe1a10>>
[I 2022-03-23 11:24:04.849 EnterpriseGatewayApp] FD: 22: file obj: <zmq.Socket(zmq.XREQ) at 0x7f269c043c20>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c0610d0>>
[I 2022-03-23 11:24:04.849 EnterpriseGatewayApp] FD: 23: file obj: <zmq.Socket(zmq.SUB) at 0x7f269c043fa0>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c061f50>>
[I 2022-03-23 11:24:04.850 EnterpriseGatewayApp] FD: 24: file obj: <zmq.Socket(zmq.XREQ) at 0x7f269c0439f0>: handler: _handle_events: <bound method ZMQStream._handle_events of <zmq.eventloop.zmqstream.ZMQStream object at 0x7f269c061750>>
[I 2022-03-23 11:24:04.850 EnterpriseGatewayApp] FD: 5: file obj: <ssl.SSLSocket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 9547), raddr=('127.0.0.1', 41444)>: handler: _handle_events: <bound method BaseIOStream._handle_events of <tornado.iostream.SSLIOStream object at 0x7f269c061bd0>>
[I 2022-03-23 11:24:04.850 EnterpriseGatewayApp] COMPLETED pringing FD =============

Expected behavior

There should not any FD leak but this is happening on Eg due to the remote nature of the Kernel and the extra time involved in fetching the kernl_info from the remote kernel. On local kernels, this issue is rare as we probably get the response immediately and this race condition does not occur.

We need to handle this race condition between Restart and Shutdown of Kernel requests on EG.
Few thoughts that come at an high level:

  1. Should we even allow processing a Kernel Shutdown while Restart is still not completed on EG? We can make use of the kernel.restarting field which is already set during kernel restart on EG. Code:
  2. Can we preserve the kenel_info_request connection socket within the kernel object and close it down when executing shutdown request. This is probably require change in notebook / jupyter_server where MappingKernelManager is present.

Open to other suggestions as well and I am interested to contribute the fix back.

Context

Jupyter Enterprise Gateway : 2.1.0
Jupyter Notebook : 6.0.3
Jupyter Client : 6.1.3
Tornado : 6.0.4
PyZMQ : 19.0.2

Thanks!

@rahul26goyal rahul26goyal changed the title EG becomes unusable after receiving FD added twice error due to a race condition EG becomes unusable after receiving FD added twice error due to a race condition [WIP] Mar 24, 2022
@rahul26goyal rahul26goyal changed the title EG becomes unusable after receiving FD added twice error due to a race condition [WIP] EG becomes unusable after receiving FD added twice error due to a race condition! Mar 24, 2022
rahul26goyal added a commit to rahul26goyal/enterprise_gateway that referenced this issue Mar 25, 2022
@kevin-bates
Copy link
Member

Hi @rahul26goyal - I am not able to reproduce this issue and would like for you to try to reproduce this on EG 2.6, if that's possible. I will proceed with the PR's review anyway and we can decide if its merge is appropriate at that time. Thanks for your understanding.

@rahul26goyal
Copy link
Contributor Author

rahul26goyal commented Mar 29, 2022

Thanks for trying this out @kevin-bates ..I will try to reproduce on our setup with EG-2.6 and get back.
Please continue to review the changes in the meanwhile.

rahul26goyal added a commit to rahul26goyal/enterprise_gateway that referenced this issue Apr 29, 2022
rahul26goyal added a commit to rahul26goyal/enterprise_gateway that referenced this issue Apr 29, 2022
kevin-bates pushed a commit that referenced this issue May 16, 2022
* Attempt to fix issue #1051: FD leak due to race condition
* handled the changes for kernel auto restarter
* review comments addressed
@kevin-bates
Copy link
Member

Resolved via #1054.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants