buffer messages when websocket connection is interrupted #2871

rgbkrk · 2017-09-27T22:12:35Z

This adds an (unbounded) queue to the ZMQChannelsHandler to replay messages when a user reconnects.

This assists those that lose a connection and keep their tab open. It does not help with the general problem of wanting a long running notebook to be saved in the background.

rgbkrk · 2017-09-27T22:17:54Z

D'oh. Of course. This is per connection and this goes away when the connection is severed.

rgbkrk · 2017-09-28T01:53:39Z

I'm really unsure of how I could work around this to buffer messages to replay to clients when they reconnect.

minrk · 2017-09-28T15:59:32Z

Yeah, it would need to go somewhere else. Right now, we create a zmq socket per websocket connection. That zmq socket is destroyed when the websocket connection is lost.

What we would need to do is:

leave the zmq connections open after the websocket goes away (potentially with a timeout, but kernel shutdown is also a reasonable time to close)
when clients close, start buffering
persist transient state of cell:msg_id mapping to be restored on reconnect, so that the right handlers are hooked up to deal with the resuming messages

I think this is going to be tricky without moving the document state to the server, but may still be worth exploring.

rgbkrk · 2017-09-28T16:04:12Z

I think this is going to be tricky without moving the document state to the server, but may still be worth exploring.

Yeah I still want that in the long term, it's hard to keep promising that in the short term though.

Yeah, it would need to go somewhere else. Right now, we create a zmq socket per websocket connection. That zmq socket is destroyed when the websocket connection is lost.

Yikes, alright. Do we just keep one extra zmq socket around then in that case. I see now how I was spinning some wheels hopelessly -- the zmq connection is created here when the websocket connection is created.

minrk · 2017-09-28T16:10:27Z

Do we just keep one extra zmq socket around then in that case.

For the lost connection case (not new tab, new browser, etc.), then replay should be fine, and all we need to track is the session_id, which is what identifies a browser session. This should work:

on websocket close, leave zmq stream open and start buffering instead of closing, as we do now.
on websocket open, check session_id and reconnect to zmq_stream and replay if one already exists for that session_id. If session_id doesn't match, clear any buffers that may be open (i.e. don't allow multiple viewers to have this resume behavior on the same kernel).

minrk · 2017-09-28T16:12:19Z

Saving the hairy cell:msg_id problem for another task makes sense to me. We can focus on the lost connection case, which will improve several real use cases.

rgbkrk · 2017-09-28T16:25:56Z

Based on conversations with Min, I'm going to take a new stab at this within the ZMQChannelsHandler and:

Attaching the buffer to the kernel object
There will only be one buffer per kernel
This is only intended to make reconnecting with the same tab work well (we aren't worried about multiple sessions since we don't work well with multiple tabs now anyways...)

blink1073 · 2017-09-28T16:33:35Z

alexgarel · 2017-10-03T16:08:40Z

Just as you implements this particular mechanism, in case you want, you may also think of multi concurrent user notebooks in this form:

multiple user are working on the same notebook
as soon as a user submit a cell, it's value is updated in the other user notebook (in case of conflict, interface propose something)
as soon as the result is ready, it is displayed in all notebooks

rgbkrk · 2017-10-03T16:50:44Z

We're going to tackle multi user concurrent notebooks with a server side state of the notebook, which is not done today. This is only a band aid. Check out https://github.com/jupyterlab/jupyterlab-google-drive for the current alpha for realtime.

minrk · 2017-10-03T17:25:01Z

I've pushed an implementation that works. It's still an unbounded list.

buffer is per-kernel
session_key is stored because only the same session that is buffered will resume properly
on any new connection to a kernel, buffer is flushed.
If session_key matches, buffer is replayed.
Otherwise, it is discarded (this will be refreshed pages, new browser windows).
buffer is an unbounded list for now
buffer is never discarded until the kernel shuts down

Things we could do:

limit the lifetime of the buffer (e.g. expire and close after 10 minutes?)
limit the size of the buffer (msg count and/or total bytes)

- buffer is per-kernel - session_key is stored because only a single session can resume the buffer and we can't be sure - on any new connection to a kernel, buffer is flushed. If session_key matches, it is replayed. Otherwise, it is discarded. - buffer is an unbounded list for now

rather than establishing new connections fixes failure to resume shell channel

rgbkrk · 2017-10-03T17:47:41Z

notebook/services/kernels/handlers.py

    def open(self, kernel_id):
        super(ZMQChannelsHandler, self).open()
        self.kernel_manager.notify_connect(kernel_id)
+
+        # on new connections, flush the message buffer
+        replay_buffer = self.kernel_manager.stop_buffering(kernel_id, self.session_key)


Ok cool you made the kernel manager dictate stopping the buffering

rgbkrk · 2017-10-03T17:48:09Z

notebook/services/kernels/handlers.py

+            self.log.info("Replaying %s buffered messages", len(replay_buffer))
+            for channel, msg_list in replay_buffer:
+                stream = self.channels[channel]
+                self._on_zmq_reply(stream, msg_list)


What should we do if we fail during the replay?

rgbkrk · 2017-10-03T20:12:47Z

PASS ws_closed_error: kernel_idle.Kernel was triggered
Timeout for http://localhost:8888/a@b/
Is the notebook server running?
FAIL "function () {
        return this.evaluate(function (events) {
            return IPython._events_triggered.length >= events.length;
        }, [events]);
    }" did not evaluate to something truthy in 10000ms
#    type: uncaughtError
#    file: /home/travis/build/jupyter/notebook/notebook/tests/services/kernel.js
#    error: "function () {
        return this.evaluate(function (events) {
            return IPython._events_triggered.length >= events.length;
        }, [events]);
    }" did not evaluate to something truthy in 10000ms
#    stack: not provided
Captured console.log:

Should we skip this test for now?

minrk · 2017-10-04T07:49:00Z

Should we skip this test for now?

No, I think it's a real bug.

instead of in `create_stream`, which is not called on reconnect

- dismiss 'connection lost' dialog on reconnect - set busy status on reconnect (if not busy, idle will come soon after via kernel_ready)

minrk · 2017-10-04T12:25:44Z

Bug fixed. I also improved the dialog/status handling when connection is lost.

minrk · 2017-10-04T13:00:15Z

Here's what it looks like now with a connection drop and reconnect in the middle of execution:

The dialog is dismissed automatically when the connection is restored.

rgbkrk · 2017-10-04T16:57:41Z

This is working pretty well. Even for a flaky VPN + wifi dropoff, I'm only losing one message within a one second gap.

blink1073 · 2017-10-04T21:24:51Z

What if this were implemented as a response to a message from the frontend that said "hey, I was disconnected but I'm back, here's the last msg_id I got, please send the rest"?

rgbkrk · 2017-10-04T21:27:19Z

here's the last msg_id I got, please send the rest

Oooooh, I like that. I mean, it will be a bit strange as we'll be putting a new API on top of the current API over the websockets though.

minrk · 2017-10-05T11:09:33Z

@blink1073 that's definitely possible and would be more robust. To do that, we would need to maintain a cache of messages on all channels all the time, in case of dropped connections in the future, rather than only when connections are lost. Doing that points to a different mechanism, because it makes it inevitable that the cache gets huge, rather than possible. To do that, we need to spill to disk with an efficient culling mechanism. To me, that sounds like a job for sqlite.

If I were doing that, I would make last_msg_id part of the websocket connect request to trigger the replay.

alexgarel · 2017-10-05T13:00:50Z

If you use sqlite for messages, I would suggest default sqlite database should be in memory database.
Of course persistent database would allow to gets past results even after a jupyter server restart ! That sounds cool. But you'll have to find a policy to remove dead kernel messages.

rgbkrk · 2017-10-06T14:30:35Z

I think this is probably good to go ahead with for the time being as further enchancements are going to require a lot more coordination between the frontends and the server.

gnestor · 2017-10-06T16:15:33Z

Thanks @rgbkrk and @minrk!!

rgbkrk · 2017-10-06T16:47:58Z

Thanks @gnestor and @minrk!

minrk · 2017-10-07T11:04:53Z

@alexgarel I think it can't be in-memory by default because the main point of using sqlite would be to limit how much memory the message cache will take by spilling to disk.

kevin-bates · 2018-10-09T23:58:34Z

@rgbkrk & @minrk - First, sorry for posting this to a closed PR but it has all the necessary context. Second, I apologize for my long windedness. TL;DR: In looking at the buffering code, it seems like buffer replay never comes into play (at this time).

I needed to look into buffering support relative to a recent NB2KG issue. Since NB2KG proxies the notebook's kernel management to a gateway server (Kernel Gateway or Enterprise Gateway), I was concerned that buffering replay wouldn't work. However, after looking into this more deeply, I find these behaviors between traditional notebook and nb2kg-enabled notebooks to be the same. I also find that when the network connection between the browser and notebook is dropped, the "replay" as indicated by the video above is not taking place from the buffering code, but, rather from ZMQ directly (at least that's my assumption). I say this because I don't see any debug entries about buffering messages and replaying or discarding them. Instead, those messages are produced only after one closes the notebook browser tab, then opens that same active notebook again (which Kyle mentions above as well in describing when the PR changes take place). As a result, I don't understand when the buffering replay will actually occur since it seems like re-opening an active notebook triggers the creation of a new session - which indirectly is part of the key (along with kernel_id) into the replayability of buffered messages.

Do I have the correct understanding with respect to the state of buffering and its replay? If not, under what circumstances does the implemented buffering get replayed?

Could you explain why session is part of the key for determining whether the buffered messages are to be replayed or not? I'm guessing it represents the "channel instances" since different connections of the same kernel instance will use different channels - is that right? If so, would it be possible to equate the channel "sets" from the previous (buffered) session to the new session and drop the use of session_key?

Thank you for your time and increasing my understanding.

rgbkrk · 2018-10-10T03:52:29Z

They are attached to a session because otherwise the frontend has no way of associating output messages (by msg_id) to the cells they originated from.

kevin-bates · 2018-10-10T22:42:46Z

Thanks Kyle. So I think buffered message usage from NB2KG is doubly screwed (once replay is figured out) because the session_key generated in the Gateway server will be different than that generated in the (client) Notebook server where the messages are derived. Is this session key the same value as the session entry in each message (or can be derived from them)? If so, it would be nice if the message buffer initialized the session key in that manner.

Any hints on how to trigger buffer replays (and not discards)? If not, I'll post in gitter.

Thanks.

rgbkrk · 2018-10-10T23:03:42Z

I'm never in gitter, hopefully you'll find someone to help.

I'd have to sit down again with the raw session messages to know for sure. Happy to meet about this at some point in the coming weeks (I'm out of town at the moment).

alexgarel · 2018-10-11T08:21:01Z

They are attached to a session because otherwise the frontend has no way of associating output messages (by msg_id) to the cells they originated from.

Could it be associated with a notebook session id (instead of connection), that would be generated on client side (as a document DOM attribute), and that is given at websocket opening ?

kevin-bates · 2018-10-11T22:34:55Z

If the session id you're referring to is the one in the header (vs parent_header) that does appear to be consistent across the disconnection scenario. However, even if I remove the checks for session_key (so that replay is purely a function of matching kernel_id), the replayed buffer contents (just as with the current message results) have no where to go since the underlying websocket is different (at least that's my suspicion as to why nothing appears). That said, the js code is totally foreign to me and I think that's where the answers lie.

Seems like there needs to be some tolerance where the received messages can 'switch' to a different "channel set" in cases where the previous 'set' is no longer around.

gnestor · 2018-11-07T16:18:05Z

Does this sound related to jupyter/jupyter#83?

kevin-bates · 2018-11-08T16:00:50Z

Does this sound related to jupyter/jupyter#83?

Yes, I believe the issue you reference is the portion of this PR that is not addressed. That is, when a notebook is closed and re-opened, buffered messages received in the interim are discarded on the re-open due to it being a different WS connection. I took a run at this via #4105/#4110 but later learned the current implementation is working as designed. (I hadn't been able to get any messages to replay correctly, but Min's evidence on the corresponding PR showed its possible, so I dropped "my case".)

rgbkrk · 2018-11-08T20:39:53Z

The only way you can get messages to replay is if you lose connection, primarily on a remote server.

bjtho08 · 2019-06-04T12:19:20Z

If I put my laptop in hibernate with an open jupyterlab tab with a running notebook inside and then wake up the laptop again, no messages are replayed at all and it doesn't receive new messages either bewfore I actively tell it to run a new cell. In effect this means, that if I am e.g. running a cell that takes two days to complete, then I cannot see any results made after I put the computer in hibernate. So the only solution is to try and stay connected for as long as possible or the silly way: opening a remote desktop to the jupyter server, opening a browser on the remote and running the notebooks like that. Any solution expected soon?

rgbkrk requested a review from minrk September 27, 2017 22:12

rgbkrk mentioned this pull request Sep 28, 2017

Add vdom-extension jupyterlab/jupyter-renderers#37

Merged

1 task

provide some top level comments

569bb25

rgbkrk force-pushed the buffer-messages branch from 9e4d902 to 569bb25 Compare September 28, 2017 21:02

rgbkrk changed the title ~~[WIP] buffer messages in the ZMQStreamHandler~~ [WIP] buffer messages in the ZMQChannelsHandler Oct 3, 2017

minrk force-pushed the buffer-messages branch from 047f22d to a182e0a Compare October 3, 2017 17:20

minrk force-pushed the buffer-messages branch from a182e0a to 8f8363a Compare October 3, 2017 17:28

restore actual zmq channels when resuming connection

38224fb

rather than establishing new connections fixes failure to resume shell channel

minrk force-pushed the buffer-messages branch from 95f5b3f to 38224fb Compare October 3, 2017 17:51

rgbkrk commented Oct 3, 2017

View reviewed changes

minrk added 2 commits October 4, 2017 09:55

hookup restart callbacks in open

68602fa

instead of in `create_stream`, which is not called on reconnect

improve handling of restored connections in js

db4ca5e

- dismiss 'connection lost' dialog on reconnect - set busy status on reconnect (if not busy, idle will come soon after via kernel_ready)

minrk force-pushed the buffer-messages branch from 65c9817 to db4ca5e Compare October 4, 2017 12:25

rgbkrk changed the title ~~[wip] buffer messages when websocket connection is interrupted~~ buffer messages when websocket connection is interrupted Oct 6, 2017

rgbkrk mentioned this pull request Oct 6, 2017

Confused about what Edit/View does #2203

Closed

gnestor merged commit 43a9780 into jupyter:master Oct 6, 2017

rgbkrk deleted the buffer-messages branch October 6, 2017 16:47

minrk mentioned this pull request Oct 10, 2017

allow disabling offline message buffering #2916

Merged

rgbkrk mentioned this pull request Oct 13, 2017

Release 5.2 #2929

Closed

11 tasks

gnestor added this to the 5.2 milestone Oct 13, 2017

takluyver mentioned this pull request Jan 17, 2018

notebook stops updating when reopened jupyter/help#281

Open

takluyver mentioned this pull request Feb 6, 2018

keep notebook running after the browser tab closed #1647

Closed

kevin-bates mentioned this pull request Oct 12, 2018

[Question] Does nb2kg support buffer_offline_messages jupyter/nb2kg#19

Closed

github-actions bot added the status:resolved-locked label Mar 28, 2021

github-actions bot locked as resolved and limited conversation to collaborators Mar 28, 2021

buffer messages when websocket connection is interrupted #2871

buffer messages when websocket connection is interrupted #2871

Conversation

rgbkrk commented Sep 27, 2017 • edited Loading

rgbkrk commented Sep 27, 2017

rgbkrk commented Sep 28, 2017

minrk commented Sep 28, 2017

rgbkrk commented Sep 28, 2017

minrk commented Sep 28, 2017 • edited Loading

minrk commented Sep 28, 2017

rgbkrk commented Sep 28, 2017 • edited Loading

blink1073 commented Sep 28, 2017 • edited Loading

alexgarel commented Oct 3, 2017 • edited Loading

rgbkrk commented Oct 3, 2017

minrk commented Oct 3, 2017

rgbkrk Oct 3, 2017

Choose a reason for hiding this comment

rgbkrk Oct 3, 2017

Choose a reason for hiding this comment

rgbkrk commented Oct 3, 2017

minrk commented Oct 4, 2017

minrk commented Oct 4, 2017

minrk commented Oct 4, 2017

rgbkrk commented Oct 4, 2017

blink1073 commented Oct 4, 2017

rgbkrk commented Oct 4, 2017

minrk commented Oct 5, 2017

alexgarel commented Oct 5, 2017

rgbkrk commented Oct 6, 2017

gnestor commented Oct 6, 2017

rgbkrk commented Oct 6, 2017

minrk commented Oct 7, 2017

kevin-bates commented Oct 9, 2018

rgbkrk commented Oct 10, 2018

kevin-bates commented Oct 10, 2018

rgbkrk commented Oct 10, 2018

alexgarel commented Oct 11, 2018

kevin-bates commented Oct 11, 2018

gnestor commented Nov 7, 2018

kevin-bates commented Nov 8, 2018

rgbkrk commented Nov 8, 2018

bjtho08 commented Jun 4, 2019

rgbkrk commented Sep 27, 2017 •

edited

Loading

minrk commented Sep 28, 2017 •

edited

Loading

rgbkrk commented Sep 28, 2017 •

edited

Loading

blink1073 commented Sep 28, 2017 •

edited

Loading

alexgarel commented Oct 3, 2017 •

edited

Loading