Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enterprise Gateway server stops responding after receiving FD added twice error #1047

Closed
Vishwajeet0510 opened this issue Mar 14, 2022 · 8 comments
Labels

Comments

@Vishwajeet0510
Copy link

Description

We are using Jupyter Enterprise Gateway (v2.1.0) which runs on a Kuberetes pod. After a kernel is started, UI tries to establish Websocket connections and ZMQ streams are created for handling the I/O with kernel.
These use File descriptors(FDs) for the management of the socket connections.

However, when we shutdown a kernel and then restart it from Notebook UI, in case the kernel is assigned a FD which is currently being used by some other kernel or was not unregistered due to some reason when the previous kernel was shutdown, a raise ValueError("fd %s added twice" % fd) exception is thrown.

Further, we see that notebook UI server keeps on bombarding with websocket open request again and again to create the connection and every time the request fails with the same FD added twice error. Since, the JEG gets bombed with this request continuously, it leaves the JEG completely unresponsive.

Jupyter Enterprise Gateway Logs (Current Behavior)

# sendign kerenl shutdown request
[I 02-23  04:22:47.479 EnterpriseGatewayApp] successfully validated request: /api/kernels/2562c820-1a9e-4f00-80fa-c0496139c62a
[I 220223 04:22:48 web:2243] 101 GET /api/kernels/2562c820-1a9e-4f00-80fa-c0496139c62a/channels (127.0.0.1) 1.94ms
[I 220223 04:22:51 web:2243] 200 GET /api/kernels/2562c820-1a9e-4f00-80fa-c0496139c62a (127.0.0.1) 1.08ms
[I 220223 04:22:51 web:2243] 200 GET /api/kernels/2562c820-1a9e-4f00-80fa-c0496139c62a (127.0.0.1) 0.59ms
[I 2022-02-23 04:22:51.121 EnterpriseGatewayApp] Kernel shutdown: 2562c820-1a9e-4f00-80fa-c0496139c62a
[I 220223 04:22:57 web:2243] 204 DELETE /api/kernels/2562c820-1a9e-4f00-80fa-c0496139c62a (127.0.0.1) 6351.44ms
[I 2022-02-23 04:22:57.479 EnterpriseGatewayApp] successfully validated request: /api/kernels/2562c820-1a9e-4f00-80fa-c0496139c62a/restart
[E 2022-02-23 04:22:57.479 EnterpriseGatewayApp] Exception restarting kernel
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 83, in post
        yield maybe_future(km.restart_kernel(kernel_id))
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 762, in run
        value = future.result()
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 234, in wrapper
        yielded = ctx_run(next, result)
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 307, in restart_kernel
        self._check_kernel_id(kernel_id)
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 387, in _check_kernel_id
        raise web.HTTPError(404, u'Kernel does not exist: %s' % kernel_id)
    tornado.web.HTTPError: HTTP 404: Not Found (Kernel does not exist: 2562c820-1a9e-4f00-80fa-c0496139c62a)
    
#Restart failed because the kerel was already shutdown
[E 220223 04:22:57 web:2243] 500 POST /api/kernels/2562c820-1a9e-4f00-80fa-c0496139c62a/restart (127.0.0.1) 1.44ms

[I 2022-02-23 04:23:06.531 EnterpriseGatewayApp] Kernel started: 4160c20e-c2ee-4df5-8c28-3716d135de5d
[I 220223 04:23:06 web:2243] 201 POST /api/kernels (127.0.0.1) 8313.27ms
[I 2022-02-23 04:23:07.566 EnterpriseGatewayApp] successfully validated request: /api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels
[W 2022-02-23 04:23:07.567 EnterpriseGatewayApp] No session ID specified
[E 220223 04:23:08 web:1793] Uncaught exception GET /api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels (172.18.96.19)
    HTTPServerRequest(protocol='https', host='localhost:18888', method='GET', uri='/api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/websocket.py", line 954, in _accept_connection
        open_result = handler.open(*handler.open_args, **handler.open_kwargs)
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 275, in open
        self.create_stream()
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 128, in create_stream
        self.channels[channel] = stream = meth(self.kernel_id, identity=identity)
      File "/emr/notebook-env/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py", line 34, in wrapped
        r = method(*args, **kwargs)
      File "/emr/notebook-env/lib/python3.7/site-packages/jupyter_client/ioloop/manager.py", line 21, in wrapped
        return ZMQStream(socket, self.loop)
      File "/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 113, in __init__
        self._init_io_state()
      File "/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 540, in _init_io_state
        self.io_loop.add_handler(self.socket, self._handle_events, self.io_loop.READ)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 147, in add_handler
        raise ValueError("fd %s added twice" % fd)
 *ValueError: fd 28 added twice*
[W 2022-02-23 04:23:10.623 EnterpriseGatewayApp] No session ID specified
[I 220223 04:23:10 web:2243] 101 GET /api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels (127.0.0.1) 1.90ms
[E 220223 04:23:10 web:1793] Uncaught exception GET /api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels (127.0.0.1)
    HTTPServerRequest(protocol='https', host='localhost:18888', method='GET', uri='/api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/websocket.py", line 954, in _accept_connection
        open_result = handler.open(*handler.open_args, **handler.open_kwargs)
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 275, in open
        self.create_stream()
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 128, in create_stream
        self.channels[channel] = stream = meth(self.kernel_id, identity=identity)
      File "/emr/notebook-env/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py", line 34, in wrapped
        r = method(*args, **kwargs)
      File "/emr/notebook-env/lib/python3.7/site-packages/jupyter_client/ioloop/manager.py", line 21, in wrapped
        return ZMQStream(socket, self.loop)
      File "/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 113, in __init__
        self._init_io_state()
      File "/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 540, in _init_io_state
        self.io_loop.add_handler(self.socket, self._handle_events, self.io_loop.READ)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 147, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 29 added twice
[I 2022-02-23 04:23:12.439 EnterpriseGatewayApp] successfully validated request: /api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels
.
.
E 220223 04:56:21 web:1793] Uncaught exception GET /api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels (127.0.0.1)
    HTTPServerRequest(protocol='https', host='localhost:18888', method='GET', uri='/api/kernels/4160c20e-c2ee-4df5-8c28-3716d135de5d/channels', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/websocket.py", line 954, in _accept_connection
        open_result = handler.open(*handler.open_args, **handler.open_kwargs)
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 275, in open
        self.create_stream()
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 128, in create_stream
        self.channels[channel] = stream = meth(self.kernel_id, identity=identity)
      File "/emr/notebook-env/lib/python3.7/site-packages/jupyter_client/multikernelmanager.py", line 34, in wrapped
        r = method(*args, **kwargs)
      File "/emr/notebook-env/lib/python3.7/site-packages/jupyter_client/ioloop/manager.py", line 21, in wrapped
        return ZMQStream(socket, self.loop)
      File "/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 113, in __init__
        self._init_io_state()
      File "/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 540, in _init_io_state
        self.io_loop.add_handler(self.socket, self._handle_events, self.io_loop.READ)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 147, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 33 added twice
[I 220223 04:56:22 web:2243] 200 GET /api/kernelspecs (127.0.0.1) 2.01ms
[E 220223 04:56:23 tcpserver:334] Error in connection callback
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/tcpserver.py", line 319, in _handle_connection
        read_chunk_size=self.read_chunk_size,
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1378, in __init__
        self._add_io_state(self.io_loop.WRITE)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1058, in _add_io_state
        self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 147, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 33 added twice
.
.

Even if the exception is handled properly, we still observe the same behavior.

[I 2022-03-02 13:29:37.668 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/836745cd-c803-45a5-979d-70f4ef803a96/channels
[W 2022-03-02 13:29:37.668 EnterpriseGatewayApp] No session ID specified
[I 220302 13:29:37 web:2243] 101 GET /api/kernels/836745cd-c803-45a5-979d-70f4ef803a96/channels (127.0.0.1) 2.44ms
[E 2022-03-02 13:29:37.670 EnterpriseGatewayApp] Error opening stream: fd 33 added twice
[I 2022-03-02 13:29:38.882 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/836745cd-c803-45a5-979d-70f4ef803a96/channels
.
.
[I 220302 13:29:40 web:2243] 101 GET /api/kernels/836745cd-c803-45a5-979d-70f4ef803a96/channels (127.0.0.1) 2.37ms
[E 2022-03-02 13:29:40.026 EnterpriseGatewayApp] Error opening stream: fd 33 added twice
[I 2022-03-02 13:29:41.210 EnterpriseGatewayApp] validating incoming request: GET: /api/kernels/836745cd-c803-45a5-979d-70f4ef803a96/channels
[I 2022-03-02 13:29:41.210 EnterpriseGatewayApp] executing validation for token: 1d2d650e-c77a-4f7b-be55-75f12c8eb027

.
.
.

Moreover, we see that sometimes the FD is present in the PID directory and sometimes it isn't i.e. when ValueError: fd xx added twice is observed

// FD is present
[E 220314 07:07:36 tcpserver:336] Error in connection callback
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/tcpserver.py", line 319, in _handle_connection
        read_chunk_size=self.read_chunk_size,
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1378, in __init__
        self._add_io_state(self.io_loop.WRITE)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1058, in _add_io_state
        self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 152, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 17 added twice

[xxx-notebook@jeg-sp6gyyuh2d7x0-5fc79fb459-6rg76 ~]$ ls /proc/8/fd
0  10   101  103  11  13  15  17  19  20  22  24  26  28  3   31  33  35  37  39  40  42  44  46  48  5   51  53  55  57  59  61  63  65  67  69  70  72  77  79  80  82  84  89  90  92  94  96  98


// FD is not present
[E 220314 07:46:38 tcpserver:330] Error in connection callback
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/tcpserver.py", line 315, in _handle_connection
        read_chunk_size=self.read_chunk_size,
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1404, in __init__
        self._add_io_state(self.io_loop.WRITE)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1076, in _add_io_state
        self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 97, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 10 added twice

[emr-notebook@jeg-sp6gyyuh2d7x0-679bc5584c-pphvm ~]$ ls /proc/8/fd
0 1 11 2 22 29 3 32 33 34 35 37 45 47 48 5 57 58 59 6 60 61 62 63 64 65 7 71 8 9

When stopping the Notebook server some FDs get unregistered

Notebook Server Logs

^C[I 16:35:50.723 NotebookApp] interrupted
Serving notebooks from local directory: /Users/vervishw/Documents/jeg_2.1.0_env
1 active kernel
The Jupyter Notebook is running at:
http://localhost:8888/?token=95f0ebe651783628985892946dafac32a9fc50aa999b1d12
 or http://127.0.0.1:8888/?token=95f0ebe651783628985892946dafac32a9fc50aa999b1d12
Kernels will be managed by the Gateway server running at:
https://localhost:18888/
Shutdown this notebook server (y/[n])? ^C[C 16:35:51.173 NotebookApp] received signal 2, stopping
[I 16:35:51.177 NotebookApp] Shutting down 1 kernel
[D 16:35:51.178 NotebookApp] Request delete kernel at: https://localhost:18888/api/kernels/390155fb-5eb7-4160-91e4-4faac78c6a3a

JEG Logs

.
.
[E 220314 09:40:53 tcpserver:330] Error in connection callback
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/tcpserver.py", line 315, in _handle_connection
        read_chunk_size=self.read_chunk_size,
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1404, in __init__
        self._add_io_state(self.io_loop.WRITE)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1076, in _add_io_state
        self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 97, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 22 added twice
/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py:435: UserWarning: Unregistering FD 12 after closing socket. This could result in unregistering handlers for the wrong socket. Please use stream.close() instead of closing the socket directly.
  self.close()
/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py:435: UserWarning: Unregistering FD 19 after closing socket. This could result in unregistering handlers for the wrong socket. Please use stream.close() instead of closing the socket directly.
  self.close()
/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py:435: UserWarning: Unregistering FD 22 after closing socket. This could result in unregistering handlers for the wrong socket. Please use stream.close() instead of closing the socket directly.
  self.close()

// FD list after (12 and 19 are still present)
[emr-notebook@jeg-sp6gyyuh2d7x0-679bc5584c-pphvm fd]$ ls
0 1 11 12 13 14 15 16 17 18 19 2 20 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 5 50 51 52 53 54 55 6 7 8 9

Since we see that FD causing error has been unregistered, JEG server become usable again. However, if the same FD which is giving the error does not gets unregistered at the time of Notebook server restart, the JEG server remains in unresponsive state.

Reproduction Steps

  1. Start a Jupyter Notebook server and port-forward the requests where the JEG server is running.
  2. Start a new Spark notebook.
  3. Do multiple iterations of shutdown and restart (from the Notebook UI) until this error is observed (This error is intermittent).
  4. Once the error is seen, observe the JEG server become unresponsive.
  5. Restart the Jupyter Notebook server. It may or may not help the JEG server to return to normal state.

Expected behavior

  1. A FD which is currently unavailable should not be assigned to a new kernel.
  2. In case, it gets assigned and the exception is thrown, same requests for creating the connection should not be made to the JEG server.
  3. FDs should be unregistered properly. We see the warnings as above occasionally.

Workaround

The only workaround is to restart the JEG server manually.

Context

Jupyter Enterprise Gateway : 2.1.0
Jupyter Notebook : 6.0.3
Jupyter Client : 6.1.3
Tornado : 6.0.4
PyZMQ : 19.0.2

CC : @kevin-bates @rahul26goyal

@welcome
Copy link

welcome bot commented Mar 14, 2022

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@kevin-bates
Copy link
Member

We need to narrow this down, there is too much going here to adequately comment, but let me try.

  1. The initial log entries are contradicting. The restart SHOULD fail with a 404 on the kernel id because that kernel was shutdown just prior to the restart request. This is working as designed.
[I 220223 04:22:57 web:2243] 204 DELETE /api/kernels/2562c820-1a9e-4f00-80fa-c0496139c62a (127.0.0.1) 6351.44ms
...
[E 220223 04:22:57 web:2243] 500 POST /api/kernels/2562c820-1a9e-4f00-80fa-c0496139c62a/restart (127.0.0.1) 1.44ms

I'm unable to comment on the FD-related stuff (nor have I ever seen these kinds of errors) but would ask that you capture logs with DEBUG enabled since it appears that area of the code does log some debug statements.

Start a Jupyter Notebook server and port-forward the requests where the JEG server is running.

Regarding this entry, by port-forward the requests are you merely referring to the fact that you've configured your notebook server to forward kernel requests to the EG using --gateway-url (or NB2KG equivalent) or have you configured some kind of actual port-forwarding mechanisms?

Do multiple iterations of shutdown and restart (from the Notebook UI) until this error is observed (This error is intermittent).

Is this practical!? Obviously, the restart is going to fail if the shutdown completes so are you trying to introduce a race condition here? What is the purpose or use-case going on??

I don't have a 2.1.0 environment handy, can you try this with EG 2.6.0?

@Vishwajeet0510
Copy link
Author

Hello Kevin

  1. Yes, we agree, the kernel shutdown and restart are working as expected.

  2. After enabling DEBUG logs on JEG.

[I 220315 15:17:15 web:2250] 204 DELETE /api/kernels/b45dce14-567e-4544-979b-ae74873e4c31 (127.0.0.1) 10231.16ms
[E 2022-03-15 15:17:23.668 EnterpriseGatewayApp] Exception restarting kernel
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 83, in post
        yield maybe_future(km.restart_kernel(kernel_id))
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
        value = future.result()
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 209, in wrapper
        yielded = next(result)
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 307, in restart_kernel
        self._check_kernel_id(kernel_id)
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 387, in _check_kernel_id
        raise web.HTTPError(404, u'Kernel does not exist: %s' % kernel_id)
    tornado.web.HTTPError: HTTP 404: Not Found (Kernel does not exist: b45dce14-567e-4544-979b-ae74873e4c31)
[E 220315 15:17:23 web:2250] 500 POST /api/kernels/b45dce14-567e-4544-979b-ae74873e4c31/restart (127.0.0.1) 4.84ms
[D 2022-03-15 15:17:24.767 EnterpriseGatewayApp] RemoteMappingKernelManager.start_kernel: spark_python_kubernetes, kernel_username: emr-notebook
[D 2022-03-15 15:17:24.769 EnterpriseGatewayApp] Instantiating kernel 'PySpark (Kubernetes)' with process proxy: enterprise_gateway.services.processproxies.k8s.KubernetesProcessProxy
.
.
[D 2022-03-15 15:17:30.814 EnterpriseGatewayApp] 11: Waiting to connect to k8s pod in namespace 'kube-system'. Name: 'kf111d43b-197f-4dbd-8e46-90215a31dac9-1e57617f8e27c89d-driver', Status: 'Running', Pod IP: '192.168.163.201', KernelID: 'f111d43b-197f-4dbd-8e46-90215a31dac9'
[D 2022-03-15 15:17:32.119 EnterpriseGatewayApp] Received Payload 'uAopLOi1pPQMTA5JC6ll9LL7h5f/jQem3gmXqrGDULmP004Vkn5h4MQUsKgpiyemNSx0oZRSrRqIcDgzbadDMknzpzetVmLbyh8tN/CVGSgNtXRTr1sloclrPdDzEOsmYBeTFvKUbISzUH/Zre1JvyKNsHgXs1TD/vnJqdozn2ej7v2WwBIS78TE1UpYBw/2n5MrKB89aBZxHjzv1J+iPBbgXldMVoGyM22S28cboilGu5/gKhL1iNozeXBUu7CvKTmn0OXvSPeUx5Veb4EUxkshONpIq01riBdd2MKmD4MCMb4piW1I76yVD8xTgyjR0/1R/2n8tmR9s44QMkU7fQuMbIsfbVLN4k8wY/4J+hfxXvWzNmhqNxdyZHcCU/WLfqBuAp7IMr38/IUsVgaomA=='
[D 2022-03-15 15:17:32.119 EnterpriseGatewayApp] Decrypted Payload '{"shell_port": 40857, "iopub_port": 53397, "stdin_port": 54583, "control_port": 35843, "hb_port": 36597, "ip": "0.0.0.0", "key": "554faef1-4b70-46ea-9d72-278a5877252a", "transport": "tcp", "signature_scheme": "hmac-sha256", "kernel_name": "", "pid": "74", "pgid": "24", "comm_port": 40441}'
[D 2022-03-15 15:17:32.119 EnterpriseGatewayApp] Connect Info received from the launcher is as follows '{'shell_port': 40857, 'iopub_port': 53397, 'stdin_port': 54583, 'control_port': 35843, 'hb_port': 36597, 'ip': '0.0.0.0', 'key': '554faef1-4b70-46ea-9d72-278a5877252a', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'pid': '74', 'pgid': '24', 'comm_port': 40441}'
[D 2022-03-15 15:17:32.120 EnterpriseGatewayApp] Host assigned to the Kernel is: 'kf111d43b-197f-4dbd-8e46-90215a31dac9-1e57617f8e27c89d-driver' '192.168.163.201'
[D 2022-03-15 15:17:32.120 EnterpriseGatewayApp] Established gateway communication to: 192.168.163.201:40441 for KernelID 'f111d43b-197f-4dbd-8e46-90215a31dac9'
[D 2022-03-15 15:17:32.120 EnterpriseGatewayApp] Updated pid to: 74
[D 2022-03-15 15:17:32.120 EnterpriseGatewayApp] Updated pgid to: 24
[D 2022-03-15 15:17:32.122 EnterpriseGatewayApp] Received connection info for KernelID 'f111d43b-197f-4dbd-8e46-90215a31dac9' from host 'kf111d43b-197f-4dbd-8e46-90215a31dac9-1e57617f8e27c89d-driver': {'shell_port': 40857, 'iopub_port': 53397, 'stdin_port': 54583, 'control_port': 35843, 'hb_port': 36597, 'ip': '192.168.163.201', 'key': '554faef1-4b70-46ea-9d72-278a5877252a', 'transport': 'tcp', 'signature_scheme': 'hmac-sha256', 'kernel_name': '', 'comm_port': 40441}...
[D 2022-03-15 15:17:32.123 EnterpriseGatewayApp] Connecting to: tcp://192.168.163.201:35843
[D 2022-03-15 15:17:32.123 EnterpriseGatewayApp] Connecting to: tcp://192.168.163.201:53397
[E 220315 15:17:32 web:1792] Uncaught exception POST /api/kernels (127.0.0.1)
    HTTPServerRequest(protocol='https', host='localhost:18888', method='POST', uri='/api/kernels', version='HTTP/1.1', remote_ip='127.0.0.1')
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/web.py", line 1703, in _execute
        result = await result
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "/emr/notebook-env/lib/python3.7/site-packages/enterprise_gateway/services/kernels/handlers.py", line 72, in post
        yield super(MainKernelHandler, self).post()
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
        value = future.result()
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/handlers.py", line 46, in post
        kernel_id = yield maybe_future(km.start_kernel(kernel_name=model['name']))
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
        value = future.result()
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 742, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "/emr/notebook-env/lib/python3.7/site-packages/enterprise_gateway/services/kernels/remotemanager.py", line 77, in start_kernel
        kernel_id = yield gen.maybe_future(super(RemoteMappingKernelManager, self).start_kernel(*args, **kwargs))
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
        value = future.result()
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/gen.py", line 748, in run
        yielded = self.gen.send(value)
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 171, in start_kernel
        self.start_watching_activity(kernel_id)
      File "/emr/notebook-env/lib/python3.7/site-packages/notebook/services/kernels/kernelmanager.py", line 401, in start_watching_activity
        kernel._activity_stream = kernel.connect_iopub()
      File "/emr/notebook-env/lib/python3.7/site-packages/jupyter_client/ioloop/manager.py", line 21, in wrapped
        return ZMQStream(socket, self.loop)
      File "/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 127, in __init__
        self._init_io_state()
      File "/emr/notebook-env/lib/python3.7/site-packages/zmq/eventloop/zmqstream.py", line 552, in _init_io_state
        self.io_loop.add_handler(self.socket, self._handle_events, self.io_loop.READ)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 97, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 32 added twice
[E 220315 15:17:32 web:2250] 500 POST /api/kernels (127.0.0.1) 7360.36ms
[E 220315 15:17:37 tcpserver:330] Error in connection callback
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/tcpserver.py", line 315, in _handle_connection
        read_chunk_size=self.read_chunk_size,
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1404, in __init__
        self._add_io_state(self.io_loop.WRITE)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1076, in _add_io_state
        self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 97, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 32 added twice
[E 220315 15:17:37 tcpserver:330] Error in connection callback
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/tcpserver.py", line 315, in _handle_connection
        read_chunk_size=self.read_chunk_size,
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1404, in __init__
        self._add_io_state(self.io_loop.WRITE)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1076, in _add_io_state
        self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 97, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 32 added twice
[E 220315 15:17:37 tcpserver:330] Error in connection callback
    Traceback (most recent call last):
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/tcpserver.py", line 315, in _handle_connection
        read_chunk_size=self.read_chunk_size,
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1404, in __init__
        self._add_io_state(self.io_loop.WRITE)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/iostream.py", line 1076, in _add_io_state
        self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
      File "/emr/notebook-env/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 97, in add_handler
        raise ValueError("fd %s added twice" % fd)
    ValueError: fd 32 added twice
.
.
  1. Start a Jupyter Notebook server and port-forward the requests where the JEG server is running.
    Regarding this, we've configured our notebook server to forward kernel requests to the EG similar to --gateway-url. Like kubectl port-forward <pod-name> <port>:<port>

  2. Do multiple iterations of shutdown and restart (from the Notebook UI) until this error is observed (This error is intermittent).
    As we already mentioned this error is intermittent but it still happens sometimes which leaves the JEG unresponsive. We are trying to create a race condition here so that this error can be reproduced.

  3. We have a EG 2.1.0 up and running. Will try to check this on latest EG version and get back to you with updates. Meanwhile, can you please continue looking into this issue?

Thanks

@kevin-bates
Copy link
Member

Thanks for the updated information and DEBUG output. Unfortunately, I don't really have ideas simply because I haven't seen this before and don't have the ability to try to reproduce this.

Regarding this, we've configured our notebook server to forward kernel requests to the EG similar to --gateway-url. Like kubectl port-forward :

I'm still a little confused here. Are you using --gateway-url in your notebook server instance? If not, how are your kernel lifecycle management requests getting to your EG K8s pod? Or are you setting --gateway-url and port-forwarding the port specified in the --gateway-url option to the port at which your EG K8s service is running? You used the term similar to which implies that --gateway-url is not in use. Can you give an example of the kubectl command?

If you find moving to 2.6 is too much of a change, it might be good to first move to 2.2.0 instead. This is the first release in which async kernel management is available. It seems like this could be a race condition with port termination, but I'm not certain.

Prior to shutdown/restart, are your kernels fully starting?

I still think this exercise is impractical. What outcome do you want - the kernel is shut down or the kernel is restarted? Can this be reproduced using only restart?

Thanks.

@Vishwajeet0510
Copy link
Author

Vishwajeet0510 commented Mar 16, 2022

Hello Kevin

The notebook server is using the same as you said. Just like it is mentioned here. Sorry if there has been a confusion.

We believe that kernel should get started always. The issue can be observed in both kernel starts and restarts. Let's say while starting/restarting a kernel we received ValueError: fd xx added twice. Once we receive this error, the same request should not be made again to EG but the current behavior is EG gets bombed with the same requests (which fails again with the same error) which renders the EG unresponsive. The expectations are as below:

  1. A FD which is currently unavailable should not be assigned to a new kernel.
  2. In case, it gets assigned and the exception is thrown, same requests for creating the connection should not be made to the JEG server.
  3. FDs should be unregistered properly. We see the warnings as mentioned earlier.

The reproduction steps are just to reproduce the error as we have seen this issue intermittently(no fixed way of reproducing the scenario).

@kevin-bates
Copy link
Member

I understand.

Where I'm having difficulty is why you are seeing this and others are not. Are you suggesting this is only because you're pounding the shutdown/restart of the same kernel?

Based on the stack traces, the FD is probably associated with one of the 5 ZMQ ports but, in EG, we always generate a new set of ports and the (remote) kernel always provides those ports on restarts, so there shouldn't be any recycling going on.

Can you share your kubectl port-forward command, along with where its invoked relative to the Lab and EG servers?

@Vishwajeet0510
Copy link
Author

Hello Kevin

The EG is running on a Kubernetes pod. I've started the Notebook server by setting the gateway URL as here and port forwarding the requests from local to EG. Reference here.

So, the kernel management is being handled by the EG server running on the Kubernetes pod and the requests to start/restart a kernel are being made from the Notebook server which is running locally.

## Setting the ENV for EG URL
JUPYTER_GATEWAY_URL=https://localhost:8888/

## Starting the Notebook server locally
(jeg_2.1_env) vervishw$ jupyter notebook --debug
.
[I 11:25:06.392 NotebookApp] Kernels will be managed by the Gateway server running at:
[I 11:25:06.392 NotebookApp] https://localhost:8888/
.

## Pod where EG is running
jeg-xxxx-xxxx-xxxx                               1/1     Running     0          10d

## Forwarding the requests to the above pod for kernel management
(jeg_2.1_env) vervishw$ kubectl port-forward jeg-xxxx-xxxx-xxxx 8888:<EG port>
Forwarding from 127.0.0.1:8888 -> <EG port>
Forwarding from [::1]:8888 -> <EG port>
Handling connection for 8888
.

@kevin-bates
Copy link
Member

Closing in favor of the analysis in #1051.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants