-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enterprise Gateway server stops responding after receiving FD added twice error #1047
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
We need to narrow this down, there is too much going here to adequately comment, but let me try.
I'm unable to comment on the FD-related stuff (nor have I ever seen these kinds of errors) but would ask that you capture logs with DEBUG enabled since it appears that area of the code does log some debug statements.
Regarding this entry, by port-forward the requests are you merely referring to the fact that you've configured your notebook server to forward kernel requests to the EG using
Is this practical!? Obviously, the restart is going to fail if the shutdown completes so are you trying to introduce a race condition here? What is the purpose or use-case going on?? I don't have a 2.1.0 environment handy, can you try this with EG 2.6.0? |
Hello Kevin
Thanks |
Thanks for the updated information and DEBUG output. Unfortunately, I don't really have ideas simply because I haven't seen this before and don't have the ability to try to reproduce this.
I'm still a little confused here. Are you using If you find moving to 2.6 is too much of a change, it might be good to first move to 2.2.0 instead. This is the first release in which async kernel management is available. It seems like this could be a race condition with port termination, but I'm not certain. Prior to shutdown/restart, are your kernels fully starting? I still think this exercise is impractical. What outcome do you want - the kernel is shut down or the kernel is restarted? Can this be reproduced using only restart? Thanks. |
Hello Kevin The notebook server is using the same as you said. Just like it is mentioned here. Sorry if there has been a confusion. We believe that kernel should get started always. The issue can be observed in both kernel starts and restarts. Let's say while starting/restarting a kernel we received
The reproduction steps are just to reproduce the error as we have seen this issue intermittently(no fixed way of reproducing the scenario). |
I understand. Where I'm having difficulty is why you are seeing this and others are not. Are you suggesting this is only because you're pounding the shutdown/restart of the same kernel? Based on the stack traces, the FD is probably associated with one of the 5 ZMQ ports but, in EG, we always generate a new set of ports and the (remote) kernel always provides those ports on restarts, so there shouldn't be any recycling going on. Can you share your |
Hello Kevin The EG is running on a Kubernetes pod. I've started the Notebook server by setting the gateway URL as here and port forwarding the requests from local to EG. Reference here. So, the kernel management is being handled by the EG server running on the Kubernetes pod and the requests to start/restart a kernel are being made from the Notebook server which is running locally.
|
Closing in favor of the analysis in #1051. |
Description
We are using Jupyter Enterprise Gateway (v2.1.0) which runs on a Kuberetes pod. After a kernel is started, UI tries to establish Websocket connections and ZMQ streams are created for handling the I/O with kernel.
These use File descriptors(FDs) for the management of the socket connections.
However, when we shutdown a kernel and then restart it from Notebook UI, in case the kernel is assigned a FD which is currently being used by some other kernel or was not unregistered due to some reason when the previous kernel was shutdown, a
raise ValueError("fd %s added twice" % fd)
exception is thrown.Further, we see that notebook UI server keeps on bombarding with websocket open request again and again to create the connection and every time the request fails with the same FD added twice error. Since, the JEG gets bombed with this request continuously, it leaves the JEG completely unresponsive.
Jupyter Enterprise Gateway Logs (Current Behavior)
Even if the exception is handled properly, we still observe the same behavior.
Moreover, we see that sometimes the FD is present in the PID directory and sometimes it isn't i.e. when
ValueError: fd xx added twice
is observedWhen stopping the Notebook server some FDs get unregistered
Notebook Server Logs
JEG Logs
Since we see that FD causing error has been unregistered, JEG server become usable again. However, if the same FD which is giving the error does not gets unregistered at the time of Notebook server restart, the JEG server remains in unresponsive state.
Reproduction Steps
Expected behavior
Workaround
The only workaround is to restart the JEG server manually.
Context
CC : @kevin-bates @rahul26goyal
The text was updated successfully, but these errors were encountered: