-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix buffered message replay #4110
Conversation
Use the kernel manager's session key as the replay "key" since this value is persistent across connections. The current approach was to base the replay off the connection's session key, but that is updated on each connection. As a result, buffered message replays never occurred. Note that this change only permits the replay of buffered messages. Their display back to the user is a function of the how the front-end application determines what should be displayed. For Notebook front-ends, replayed messages still will not be displayed, although they're now available for display. Fixes jupyter#4105
I think it must be the session key of the frontend, because message buffering has to be tied to the frontend that lost its connection. Another frontend cannot consume a buffer for the frontend that initiated the buffering by disconnecting, so before replaying the buffer, we have to check that the reconnecting frontend is actually the same frontend, not a new one. If it's a new one, we should discard the buffer. Do you have an example where the wrong thing is happening? |
Thanks for the response - it's good to be able to talk about this. I've provided an example with log outputs at the end of this comment, but first I think we need to be careful about how the frontend is determined. For example, in the cases of the gateway servers (Kernel and Enterprise), the frontend is typically a remote notebook server (with its own idea of session management). However, the gateways support kernel management directly via REST whose clients aren't necessarily running Jupyter code. As a result, I believe its important to determine whether a given buffered message should be replayed based on server-side context. Now whether that context comes in some form from the client or is inferred by something from the original connection (that exists in the context of subsequent connections) is the key (sorry, pun intended). FYI, session information in the REST api used by the Gateway's is currently not required, but we could perhaps require it for replays and applications would need to be made aware of that. Then the "servers" extract it from the request and use that value as the replay key. Question. Is it possible for the same kernel instance to have multiple client connections within the same Notebook server instance (i.e., where the buffering is occurring)? As best as I can tell from my adventures in the kernel management stack, that is not the case, yet I get the impression that is the case or at least a goal (which seems reasonable). It seems that "session" is commonly used throughout the code so that might be adding to the confusion. There are session instances on the kernel manager and session instances in the websocket handler. Seems like the latter's member could be renamed to imply a physical connection "session". At any rate, since the entire idea of this feature is to replay messages from one physical websocket connection to another, then using an element unique to each of those connections to determine replay is not sufficient. In this case, that element was the ZMQChannelsHandler's session's session member. I chose to use the kernel manager's session's key value (and probably could have chosen the kernel manager's session's session member) since the kernel manager instance survives physical connection changes. Here are log outputs with DEBUG enabled. I've removed the timestamp and application indicator to reduce horizontal scrolling. I've also interspersed comments.
I hope this helps - thanks for looking. |
Yes. This is why the key must come from the client, to distinguish between clients.
Yes, this value is set by the client at connection time, and is unique per connection (Handler instance). "Resuming" connections means a reconnect (new Handler) with the same session id. I think the right thing to do is for kernel gateway to set session id on its connections so that it can resume playback. |
Thank you for the response. Sorry, but I need to persist a bit more here. Please note this isn't just a kernel_gateway way issue. All of the tracing and information I have provided is relative to Notebook - and others have confirmed. The Kernel Gateway item I brought up is the fact that REST clients of Kernel Gateway (those not using NB2KG) do not currently set any session information - so I was open to requiring that for those types of clients. However, KG/EG clients using NB2KG have the same exact problem - and I was trying to point out that we should keep physically detached scenarios in mind. Please enable DEBUG and try this. The code you point out will produce a new value for each time a tab is opened for the same notebook "session". (Could you please provide the location where that value is set on the client-side?) I'm not sure if the There are two I think there needs to be a way to better identify an actual notebook session that spans tab invocations within the same browser (although ideally it would be nice to not require the same browser session for this). |
I'm not finding anything in the runtime classes that can be used which spans the connection sessions but isn't scoped to the kernel. I don't know the front-end enough to determine what kind of things persist across connections (i.e., "tab invocations"), but reflects that particular client. Once something is identified that makes replays work in the standard notebook env, I will likely need to make it work relative to the gateway projects. (Hint: solution should use a method that can be overridden which produces the 'replay key') In the meantime, I'm inclined to recommend we change the default value of |
The session id is set to a UUID on instantiation of the Kernel object, and used here. It is purely page-local state.
Yes, and indeed it must do this, because the session id and replay buffer must be different for each connected tab, because each tab is a different client. This cannot be shared across client connections, even within the same browser. It is important to clarify that the buffer is per client connection, so a new client (a new tab is a new client, a page reload is a new client) must not get the replay buffer from a previous session.
That's correct. Because closing and reopening tabs must not resume the replay buffer, it doesn't test this. It tests the network drop case, which is the only case that is meant to be addressed by this logic, at least with the current notebook javascript. A new tab might resume it if were a more stateful application that knew how to handle messages from the previous connection (e.g. document/message state stored in localStorage or the client application), then it should also include the session id so that it can resume playback and identify itself as the same client reconnecting instead of a new client. But the way the notebook application currently works, opening a new tab doesn't have the information necessary to process resumed messages (it needs the msg_id->cell mapping, among other things), it always gets a new session.
From the logs and annotations, everything is working correctly:
Not for the default notebook server because it solves a specific and longstanding problem in the most common case: browsers whose connection is interrupted, but maybe this makes sense for the different use cases served by KG. WiFi problems, sleep/wake, etc. can all result in this issue and be helped by the current behavior with the default setup of the notebook server. This does point to a change we probably should make: if session_id is not specified by the client, we shouldn't bother buffering because there's no way to reconnect to that session.
I'm not sure what "physically detached scenarios" means here, can you elaborate? To summarize:
Can you summarize the cases where KG wants replay to behave differently (i.e. steal replay from other clients, reconnect without setting session id)? |
@minrk - thank you very much for your detailed response. It definitely helps my understanding. We want KG/EG to behave the same way - and they do - its just that my understanding of 'client session' was skewed. Knowing that a different tab instance is considered a different client helps tremendously. However, what I'm still struggling with, is how to get the buffered messages to be replayed? If I merely disconnect the network, messages that occurred during the disconnect are indeed sent back to that notebook tab upon the network's reconnection, but NONE of the buffering logic (or replay) is performed. Its like that buffering was performed (and "replayed") at a lower level - and since its the same websocket, none of the buffering logic should come into play anyway. Can you please describe what steps to take to cause this replay logic to be triggered? Are there any plans to equate a user session to something akin to the token or something that spans tab invocations, but you know its the same "user"? |
Not at the moment, but only because there isn't anything (yet) for which that information would be used. With JupyterHub, the cookie (or token) identifies the user, but this spans browsers and doesn't distinguish between the same account being used in multiple browsers. It would not be appropriate to use
When testing this functionality, I used these nginx and apache configs run in docker containers. The docker container proxies to my local notebook server. I can stop the proxy and restart it to drop and restore connections between the browser and the notebook server. The test process:
I ran this test just now (using nginx) with current master, and it worked as expected, dropping no messages. Note that testing via disconnecting wifi on a remote machine can have mixed results depending on the networking infrastructure, as the buffering relies on the server-side socket close event, which may take some time to register, resulting in some missed messages in the intervening window between the client connection being dropped and the server being told it happened. On my office wifi, this can take up to 60 seconds to register, so messages delivered in that window are lost, but messages after the server has been informed of the disconnect are properly buffered. The way to do rigorous buffering is for the server to always maintain a lookback buffer for every connected websocket and for reconnects to be made with a "last seen message" id to know where to replay from. But the memory consequences of this are too severe to be done in-memory, so it would have to use sqlite or some such out-of-memory storage to be feasible. |
@minrk - thank you for your response. And you confirmed via the Notebook server logging that the buffering logic actually took place? I definitely see replay occurring (with the delay you describe) when dropping and reinstating network connections but it is NOT exercising the code implemented in kernelmanager.py. If you could provide the notebook logging that shows the messages were replayed, that would be greatly appreciated. Thank you for your time on this. I know my persistence is probably frustrating. I agree about the in-memory buffering and is an area of concern for the gateway projects since they service kernel management ops across multiple notebook instances. |
yes. Default logs:
Debug logs:
No worries! I'm happy to figure this out and find what works best. |
Awesome - thank you for the update! I guess I'm not reproducing the correct behavior appropriately, but its great to see it in action. I will close this PR and corresponding issue. To summarize: the code that currently exists for buffering and replay of messages corresponding to the same session works as designed. The common misunderstanding being that closing a browser tab and activating another against the same notebook of the same browser instance is viewed as a different session against which buffered messages will not be replayed, but instead discarded. And that's exactly the behavior I see. Thank you for the clarification. |
Use the kernel manager's session key as the replay "key" since this value is
persistent across connections. The current approach was to base the replay
off the connection's session key, but that is updated on each connection. As
a result, buffered message replays never occurred.
Note that this change only permits the replay of buffered messages. Their
display back to the user is a function of the how the front-end application
determines what should be displayed. For Notebook front-ends, replayed
messages still will not be displayed, although they're now available for
display.
Fixes #4105