Skip to content
This repository has been archived by the owner on Dec 8, 2022. It is now read-only.

Attempt to re-establish websocket connection to KG #42

Merged
merged 1 commit into from
Jul 22, 2019

Conversation

esevan
Copy link
Contributor

@esevan esevan commented Jul 18, 2019

When nb2kg lost the connection to KG, nb2kg didn't connect to KG again
although the websocket connection from the client was still alive.

This change recovers the connection to KG to prevent above anomaly.

Signed-off-by: Eunsoo Park [email protected]

@esevan
Copy link
Contributor Author

esevan commented Jul 18, 2019

This may be related to #39

Sorry for no idea about how to make the test for this.
I attach manual test result instead.
Test Env
Cluster: kubernetes
Gateway: EG on Kubernetes
KG_REQUEST_TIMEOUT = KG_CONNECT_TIMEOUT = 10
Test Scenario

  1. Connect Python3 kernel
  2. Run the following code in a notebook cell.
from time import sleep
a = 0
for i in range(100):
    a = i
    print(a)
    sleep(1)
  1. Shutdown EG while running the above code.

Test Result
Session is recovered in around 1minutes, and kernel session restarts writing the output at the output area after recovery (See the sceenshot output jumping from 34 to 95)
image

@esevan esevan force-pushed the recover_session branch 3 times, most recently from d9dc86a to 1639002 Compare July 18, 2019 12:38
@meeseeksmachine
Copy link

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/why-does-nb2kg-connect-to-the-kernel-via-eg-rather-than-directly/1619/18

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good Evan - thank you!

@IMAM9AIS - are you able to take this for a spin in your env?

try:
message = yield self.ws.read_message()
except Exception as e:
self.log.error("Exception reading message from websocket: {}".format(e)) # , exc_info=True)
if message is None:
if not self.disconnected:
self.log.warning("Lost connection to KG: {}".format(self.kernel_id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change KG to Gateway to match others.

break
callback(message) # pass back to notebook client (see self.on_open and WebSocketChannelsHandler.open)
else: # ws cancelled - stop reading
break

if not self.disconnected: # NOTE(esevan): if websocket is not disconnected by client, try to reconnect.
self.log.info("Try to re-establish the connection to KG: {}".format(self.kernel_id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try -> Attempting, KG -> Gateway.

if self.ws is not None:
# Close connection
self.ws.close()
elif not self.ws_future.done():
# Cancel pending connection. Since future.cancel() is a noop on tornado, we'll track cancellation locally
self.ws_future.cancel()
self.ws_future_cancelled = True
self.log.debug("_disconnect: ws_future_cancelled: {}".format(self.ws_future_cancelled))
self.log.debug("_disconnect: ws_future_cancelled: {}".format(self.disconnected))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update the content here to be more like:
_disconnect: future cancelled, disconnected: {}

@IMAM9AIS
Copy link

@esevan @kevin-bates I am sorry i could not get back on this quickly.
We have been trying to debug around various points of nb2kg's code as well as EG's end.
Although we have not been able to figure out the reason why connection is lost after some time,
we do realise certain points :-

  1. When the connection is lost between NB2KG and EG, there is no on_connection_close() event being initiated by this close event(probably there is no close event being generated at all) which still keeps the state of connection unknown to NB2KG.

https://github.com/tornadoweb/tornado/blob/c92b883f806d99b35af52464980722c9c9a36a13/tornado/iostream.py#L536

This check still returns a false signal, due to which a write is finally triggered but since the stream is not available the final write operation does not work.

This is during the first shell message that you try to execute after the connection has been lost.

  1. Luckily after this happens, the state of stream gets updated, and the next time you try to execute any cell (trigger a shell message), the code finally hits this a Stream Closed Exception which is cached here :-

self.log.error("Exception writing message to websocket: {}".format(e)) # , exc_info=True)

But since the exception is gracefully handled nothing happens and the client is still unaware of the loss.

@IMAM9AIS
Copy link

@esevan I will try the changes you have done, but i am not sure that on_close() event here :-

def on_close(self):

will be called at all.

Also, from the changes you have done (from what i can understand), let's say even if the on_close() event is being generated, it still won't reconnect the sockets but just print the log to do it, which might not be user-friendly.

@IMAM9AIS
Copy link

Rather than doing this, I have another suggestion :-

What if while creating the connection between NB2KG and JEG, we also start a ping mechanism in the IOLoop, that pings every 1 min say. Now in scenario where we somehow lose a connection, the first ping will obviously do nothing but will help websocket realise that the stream is closed.

The second ping will raise an exception which will propagate back here :-

self.log.error("Exception writing message to websocket: {}".format(e)) # , exc_info=True)

But rather than handling this exception, if you raise it this actually triggers a websocket close event back to jupyter lab.

If we look into jupyter lab's websocket close event here :-

https://github.com/jupyterlab/jupyterlab/blob/c3d6801290fd5cdb527d1a9a198f97c19f890a1f/packages/services/src/kernel/default.ts#L1269

Rather the immediately closing the websocket, it tries to do reconnect attempts with increasing timeouts which will immediately fix the connection if it can be fixed with the network state.

When nb2kg lost the connection to KG, nb2kg didn't connect to KG again
although the websocket connection from the client was still alive.

This change recovers the connection to KG to prevent above anomaly.

Signed-off-by: Eunsoo Park <[email protected]>
@esevan esevan changed the title Try to re-establish websocket connection to KG Attempt to re-establish websocket connection to KG Jul 18, 2019
@esevan
Copy link
Contributor Author

esevan commented Jul 18, 2019

@kevin-bates Thanks for your review, kevin!

@IMAM9AIS Websocket connection closing event is also listened by read_message. It returns None if the connection between nb2kg and EG is closed. See more detail in https://www.tornadoweb.org/en/stable/websocket.html#client-side-support

The connection supports two styles of operation. In the coroutine style, the application typically calls read_message in a loop:

conn = yield websocket_connect(url)
while True:
    msg = yield conn.read_message()
    if msg is None: break
    # Do something with msg

In both styles, a message of None indicates that the connection has been closed.

One more thing for what it's worth, on_close callback is server-side websocket callback, which is the callback for browser-nb2kg websocket connection.

@IMAM9AIS
Copy link

@kevin-bates Thanks for your review, kevin!

@IMAM9AIS Websocket connection closing event is also listened by read_message. It returns None if the connection between nb2kg and EG is closed. See more detail in https://www.tornadoweb.org/en/stable/websocket.html#client-side-support

The connection supports two styles of operation. In the coroutine style, the application typically calls read_message in a loop:

conn = yield websocket_connect(url)
while True:
    msg = yield conn.read_message()
    if msg is None: break
    # Do something with msg

In both styles, a message of None indicates that the connection has been closed.

One more thing for what it's worth, on_close callback is server-side websocket callback, which is the callback for browser-nb2kg websocket connection.

@esevan Thanks a lot for clearing this. I will try out changes and will let you know.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants