-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SUB sockets take time to subscribe to the IOPub channel and miss important messages #593
Comments
A reliable solution to this problem is to replace the The frontend could support both behaviors and select the right one based on a a new optional field in the kernelspec ( |
The kernel should not care if there are subscribers, so I don't think the XPUB change is appropriate (talking to a kernel with only a shell socket is totally valid, and having 0-many subscribers is also valid). I think we addressed this in nb classic (and nbconvert, I believe) by probing with a kernel_info_request and waiting for status messages. Only after IOPub connection is verified are requests actually sent. I'll try to find where it happens, and maybe some change resulted in losing that logic. If there are changes to make, they are likely in notebook/Jupyter_server/JupyterLab. It could be a performance regression in libzmq causing subscriptions to take a bit longer to register, if this is happening more in recent weeks than earlier. jupyter_client could also provide a wait_for_IOPub utility function. |
I don't think the XPUB change invalidates that behavior, it just helps the client knowing the IOPub connection is established, in case the client would be waiting for that before sending requests to the kernel.
The problem is that the idle status message (after the kernel_info_request has been responded) can be sent before the IOPub connection is established, and in this case the message is lost. The client will wait forever the status message (which is exactly the problem we have here).
Maybe there's a mechanism that I am nor aware of, but for me there is no reliable way to ensure the IOPub connection has been established other than receiving a message. In that case, the kernel should regularly send idle messages for clients connecting late. |
After an in-person meeting, the solution suggested by @minrk is to ensure the IOPub connection is established with the following procedure:
The advantage of this over the XPUB (and welcome message) is that it does not require changing all the kernels. Besides, implementing it in jupyter_server and making it not accepting websocket connections until IOPub connection is established will guarantee the fix for most of javascript clients. |
Btw, the classic notebook has exactly the same issue with "restart and run all". |
After a quick glance at the code, I think these are the two relevant places in jupyter_server:
(and similar places in the current |
I'm not sure if this will be found helpful, but I wanted to share my findings/observations (and I apologize for the length).
With Since this is not what the web client will receive anyway, it may not reflect the actual results - so I added the same logging to Here's the snippet I added to the channels handler in notebook and jupyter_server just prior to the call to the superclass. if msg_type == 'status':
execution_state = msg['content']['execution_state']
if 'msg_type' in msg['parent_header']:
request = msg['parent_header']['msg_type']
self.log.debug("handler - activity on %s: from %s - %s (%s)", self.kernel_id, request, msg_type,
execution_state)
else:
self.log.debug("handler - activity on %s: %s (%s)", self.kernel_id, msg_type, execution_state)
else:
self.log.debug("handler - activity on %s: %s", self.kernel_id, msg_type) FWIW - here's the applicable snippet I added to the activity tracking - in case you wanted to display the if msg_type == 'status':
kernel.execution_state = msg['content']['execution_state']
if 'msg_type' in msg['parent_header']:
request = msg['parent_header']['msg_type']
self.log.debug("activity on %s: from %s - %s (%s)", kernel_id, request, msg_type, kernel.execution_state)
else:
self.log.debug("activity on %s: %s (%s)", kernel_id, msg_type, kernel.execution_state)
else:
self.log.debug("activity on %s: %s", kernel_id, msg_type) If I'm understanding the issue, the issue is that the IOPub channel is not started/initialized and that we should ensure an At any rate, I'm hoping you might find the additional logging easier to troubleshoot with. |
Thanks for sharing your findings @kevin-bates. I think we understand the issue well now. We will be implementing a workaround in jupyter-client and jupyter-server shortly, and proposing a change in the protocol to address the issue in a more robust way. |
I have opened a couple of PRs to notebook and jupyter_server to nudge the kernel with kernel info requests until it starts receiving IOPub notifications. |
I was able to provoke this behavior more severely and reliably by artificially dropping the first N iopub messages from the kernel: minrk/ipykernel@3fdb063 then it's very easy to see what happens if it takes a ridiculous amount of time for subscriptions to propagate, and jupyter/notebook#5908 handles it gracefully! |
Just to follow up on this for people reading this discussion for the first time: please read the continued discussion on jupyter/enhancement-proposals#65, in which Min and others approve using XPUB in a JEP. |
Should we close this issue? The JEP has been approved, and the nudge mechanism has been implemented in jupyter_client. |
Fixed by #592 |
This occurs randomly in e.g.
JupyterLab when hitting the button to "restart the kernel, and re-run the whole notebook"
(On my setup roughly one time out of three). In the screenshot above, we see that only the IOPub stream messages from Cell 4 are received. Most typically, either all or non will be shown, but an intermediate result like that one also happens.
Inspecting the messages on the websocket shows that the idle status message with the parent header corresponding to the
shutdown_request
is the first IOPub message, but it is sent after the first three stream messages corresponding to the first cells. (This has been reported here: Jupyter Lab stalls when choosing "Restart Kernel and Run All Cells..." jupyterlab/jupyterlab#9008)Voilà, when running a dashboard
In that case, it is a more rare occurrence. Basically, the idle status message for the kernel info request is never received because it is sent before
jupyter_client
has subscribed to the IOPub channel! Since the connect method in JupyterLab waits for that idle message to resolve the promise, it hangs indefinitely...I am surprised that we did not hit this before and I wonder if this is a regression in
jupyter_client
or pyzmq, or if some changes in the relative timing exposed this race condition only recently.cc @minrk @jtpio @JohanMabille @afshin
EDIT:
Independent fixes wherever we subscribe to channels. The approach is to nudge the kernel with info request until we get something on IOPub.
The text was updated successfully, but these errors were encountered: