Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol alignment #657

Merged
merged 10 commits into from
Feb 4, 2022
Merged

Conversation

davidbrochart
Copy link
Contributor

@davidbrochart davidbrochart commented Jan 12, 2022

Towards a better format for the kernel wire protocol over websocket

Context

Jupyter web clients (e.g. Jupyter Lab) communicate with Jupyter kernels over websockets, but kernels natively use Zero-MQ sockets as their transport layer. The Jupyter server acts as a bridge between kernels and web clients, and is responsible for forwarding messages to/from kernels over Zero-MQ sockets from/to web clients over websocket:

kernel <-----> Jupyter server <-----> web client
          ^                      ^
          |                      |
   Zero-MQ sockets           websocket

Current situation

Kernel wire protocol over Zero-MQ sockets

The kernel wire protocol over Zero-MQ sockets is well specified. It takes advantage of multipart messages, allowing to decompose a message into parts and to send and receive them unmerged. The following table shows the message format (the beginning has been omitted for clarity):

... 0 1 2 3 4 5 ...
... header parent_header metadata content buffer_0 buffer_1 ...

Format of a kernel message over Zero-MQ sockets (indices refer to parts, not bytes).

The message can be deserialized simply by iterating over the parts:

{
    "header": header,
    "parent_header": parent_header,
    "metadata": metadata,
    "content": content,
    "buffers": [
        buffer_0,
        buffer_1,
        ...
    ]
}

Serializing is equally easy.

Kernel wire protocol over websocket

A kernel message is serialized over websocket as follows:

0 4 8 ... offset_0 offset_1 offset_2 ...
offset_0 offset_1 offset_2 ... msg buffer_0 buffer_1 ...

Format of a kernel message over websocket (indices refer to bytes).

  • offset_0: position of the kernel message (msg) from the beginning of this message, in bytes.
  • offset_1: position of the first binary buffer (buffer_0) from the beginning of this message, in bytes (optional).
  • offset_2: position of the second binary buffer (buffer_1) from the beginning of this message, in bytes (optional).
  • msg: the kernel message, excluding binary buffers, as a UTF8-encoded stringified JSON.
  • buffer_0: first binary buffer (optional).
  • buffer_1: second binary buffer (optional).

The message can be deserialized by parsing msg as a JSON object (after decoding it to a string):

msg = {
    "channel": channel,
    "header": header,
    "parent_header": parent_header,
    "metadata": metadata,
    "content": content
}

Then retrieving the channel name, and updating with the buffers, if any:

{
    "buffers": [
        buffer_0,
        buffer_1,
        ...
    ]
}

Problem

The problem is that msg is serialized/deserialized using a JSON parser, which is very costly, considering that quite a lot of information could be included e.g. in the content. For instance, widgets not using binary buffers might pass data through this field.
Instead, the kernel wire protocol over websocket should remain as close as possible to the kernel wire protocol over Zero-MQ sockets, so that going from one to the other basically consists of moving blocks of raw memory, without any parsing involved whatsoever.

Proposal

Messages transiting over websockets cannot be decomposed into parts, as it is the case for Zero-MQ sockets. They are received as a whole, so the message must include its own layout description. Unlike the current situation, we propose to use a JSON-encoded layout format, but it is so small and simple that we don't expect any performance loss. The new format of the kernel wire protocol over websocket is the following:

0 2 2 +
layout_length
2+layout_length+
layout.offsets[0]
2+layout_length+
layout.offsets[1]
2+layout_length+
layout.offsets[2]
2+layout_length+
layout.offsets[3]
2+layout_length+
layout.offsets[4]
...
layout_length layout header parent_header metadata content buffer_0 buffer_1 ...

Format of a kernel message over websocket (indices refer to bytes).

Where:

layout = {
    "channel": channel,
    "offsets": [offset_0, offset_1, offset_2, ...]
}
  • channel is a string indicating the name of the Zero-MQ socket ("shell", "control" or "iopub").
  • offsets consists of a list of at least 3 values, corresponding to the offsets of parent_header, metadata and content, respectively (header implicitly has offset 0). If more values are provided, they correspond to the offsets of the optional binary buffers.

@echarles
Copy link
Member

Thx for opening this @davidbrochart

Does this imply that the consumer of the kernel message over websocket (eg. jupyterlab) will have to be updated to take into account this change?

@davidbrochart
Copy link
Contributor Author

davidbrochart commented Jan 12, 2022

This PR currently works with jupyterlab/jupyterlab#11841, by replacing the format of the kernel wire protocol over websocket with the proposal. However we should probably provide the new protocol on a new websocket endpoint, since we should support both protocols. The web client could try the new endpoint and fall back to the "legacy" endpoint.
I'd like to discuss this PR tomorrow at the Jupyter server meeting. Any feedback is welcome here too.

@davidbrochart
Copy link
Contributor Author

Does this imply that the consumer of the kernel message over websocket (eg. jupyterlab) will have to be updated to take into account this change?

Yes, see jupyterlab/jupyterlab#11841.

@echarles
Copy link
Member

I am working on an server extension that relies on getting a json message from the kernel manager.

https://github.com/Quansight/jupyterlab-kernel-usage/blob/51a96679f1f47272434c28d69c2445ab417603a9/jupyterlab_kernel_usage/handlers.py#L16-L41

I am prolly not the only one to rely on that.

Let's discuss tomorrow the potential impact for those kind of scenarios.

@echarles
Copy link
Member

Maybe this new protocol should be opt-in based, rather than the default. I have not yet looked closely at the diff, but the opt-in option sounds to me safer.

@davidbrochart
Copy link
Contributor Author

From what I can see in your server extension, you communicate with the kernel from the server:

kernel <-----> Jupyter server
          ^
          |
   Zero-MQ sockets

So you are not concerned with this change. Or am I missing something?

@echarles
Copy link
Member

From what I can see in your server extension, you communicate with the kernel from the server:

Let me look more at those changes by tomorrow. The linked extension expects to get from the kernel manager a complete json response in the server handler.

@blink1073
Copy link
Contributor

We could use a query parameter to select the protocol instead of having a new websocket endpoint

@blink1073
Copy link
Contributor

blink1073 commented Jan 13, 2022

@vidartf mentioned the possibility today to use websocket protocols fields to select the protocol: https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/WebSocket https://html.spec.whatwg.org/multipage/web-sockets.html#the-websocket-interface, there was a question if this could be made backwards-compatible though. If that doesn't work for us, we could use a query parameter, and the absence of a query parameter implies the server default protocol. The protocol query parameter could be given as a string or a number.

I think in order to be backwards-compatible, the server should default to the current behavior if no protocol is selected. We could change the "default" on a major version bump, similar to how the default pickle protocol is handled in Python. A client that uses these protocols must be used with a version of server that supports them (i.e. JupyterLab would bump its server dependency).

Another option brought up is to use a versioned endpoint, like r"/api/v2/kernels/%s/channels.

@davidbrochart
Copy link
Contributor Author

Another option brought up is to use a versioned endpoint, like r"/api/v2/kernels/%s/channels.

If that option was chosen, would it make sense to generalize it to all endpoints?

@blink1073
Copy link
Contributor

If that option was chosen, would it make sense to generalize it to all endpoints?

Potentially, yes

@davidbrochart
Copy link
Contributor Author

davidbrochart commented Jan 14, 2022

I went ahead and tried @vidartf's idea of using websocket protocols, and it works pretty well.
I thought we could use versions for the protocol names, so if Jupyter Lab requires protocol "0.0.1", then jupyter_server will use that.
By default, a web client doesn't require a protocol, and in this case jupyter_server falls back to the "legacy" protocol.
There is probably some code duplication that I can get rid of, but since it's working I wanted to push an update.

@blink1073
Copy link
Contributor

Very nice!

@echarles
Copy link
Member

I have played a few months ago with that optional parametr var aWebSocket = new WebSocket(url [, protocols]);

protocols Optional
Either a single protocol string or an array of protocol strings. These strings are used to indicate sub-protocols, so that a single server can implement multiple WebSocket sub-protocols (for example, you might want one server to be able to handle different types of interactions depending on the specified protocol).

(source https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/WebSocket)

My goal at that time was also to pass additional information to the server, although that information was not stricly a protocol.

If we expand the idea, the concept of capabilities instead of versions could be interesting to discuss.

In this case, the client could ask a protocol which is just basic (as it is now), or basic,only_json_header (better name to be found... - in that case, the client says it supports both the current on protocol alignment). The connection reply should contain the option chosen by the server

The client could also just say only_json_header and if the server supports that capability, the negociation is done. If the client says nothing, the basic capability is chosen.

More capabilities could be added.

@davidbrochart
Copy link
Contributor Author

Thanks Eric, that's interesting. My choice of a version number is just one idea.
I was thinking that when the new protocol is merged in jupyterlab/jupyterlab#11841, Jupyter Lab would require a jupyter_server that supports the new protocol. Note that Jupyter Lab would only support the new protocol, while jupyter_server would support the new protocol and the legacy protocol.
If we agree on this, then Jupyter Lab would not ask for any other protocol than the (only) one it supports, because it knows that the server also supports it.
Do you think the web client should also support several protocols?

@echarles
Copy link
Member

If we agree on this, then Jupyter Lab would not ask for any other protocol than the (only) one it supports, because it knows that the server also supports it.
Do you think the web client should also support several protocols?

I understand that supporting the new protocol on the frontend is easier as you remove code and make it less work to maintain.

My conservative mind tells me that would be however safer to add a new protocol, rather than dropping the existing one and replacing with an other. This is what I have been used to see around.

I guess it would be good if JupyterLab can continue to support any server, the new ones but also any other one in the field that would not support the new capabilities.

Also I can imagine that deployments would pin jupyter-server to a version that would not support that new capability. This could be for extension compatibility reason, internal policy...

So having jupyterlab supporting all protocols would be an added value. The negociation would be there to ensure the adminstrator of the server enforces what he wants (the list of supported capabilities could be a trait).

@davidbrochart
Copy link
Contributor Author

I pushed jupyterlab/jupyterlab@96b58b5.
The web client now supports the default protocol and the new one. I still think that protocol names should just be versions, so the new protocol is currently named '0.0.1'.
We first try and connect with our list of supported protocols, here ['0.0.1']. We must handle the case of a legacy server that doesn't handle subprotocols, which will return a protocol named ''. But we cannot have '' in our list of subprotocols, so the connection will fail. When we reconnect, we drop our list of subprotocols, so the connection will succeed, and we will use the default protocol.
If the server selects one of the protocols in our list at the first connection attempt, then the connection succeeds and we will use that protocol.

@echarles
Copy link
Member

Thx for working on this @davidbrochart

I still think that protocol names should just be versions, so the new protocol is currently named '0.0.1'.

I was thinking that the protocol fields could be used to provide what the client supports eg. plain,binary_body,encryption,collaborative,add_yours and the server would reply with the selection of what it supports (e.g. binary_body, add_yours. So the provided elements of the list would not be mutually exclusive like in the version approach.

@davidbrochart
Copy link
Contributor Author

I see, but it seems that this would require a "stackable" serialization/deserialization. For instance, if you wanted to add encryption, it would add an encryption on top of a non-encrypted serialized message? It would be like implementing layers in the sense of a network OSI model. While this sound interesting, I'm not sure it's easy to implement.
For instance, currently the legacy protocol and the new protocol are not composable, they are quite incompatible and mutually exclusive.
But if it's just about renaming 0.0.1 into a more meaningful name, I have no strong opinion against that.

@echarles
Copy link
Member

Yeah, the composable approach may be a bit too far fetch, but still worth to have a discussion about that.

If at the end we stick to the versioning approach, this is also great.

@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2022

Codecov Report

Merging #657 (1da03d4) into main (d201529) will decrease coverage by 0.24%.
The diff coverage is 31.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #657      +/-   ##
==========================================
- Coverage   77.76%   77.51%   -0.25%     
==========================================
  Files         110      110              
  Lines       10415    10541     +126     
  Branches     1403     1426      +23     
==========================================
+ Hits         8099     8171      +72     
- Misses       1926     1970      +44     
- Partials      390      400      +10     
Impacted Files Coverage Δ
jupyter_server/base/zmqhandlers.py 52.97% <27.77%> (-6.12%) ⬇️
jupyter_server/services/kernels/handlers.py 57.98% <31.70%> (-2.07%) ⬇️
jupyter_server/serverapp.py 65.57% <100.00%> (+0.06%) ⬆️
jupyter_server/pytest_plugin.py 84.16% <0.00%> (-1.68%) ⬇️
jupyter_server/utils.py 65.90% <0.00%> (-0.57%) ⬇️
jupyter_server/services/contents/filemanager.py 70.22% <0.00%> (-0.34%) ⬇️
jupyter_server/tests/services/kernels/test_cull.py 100.00% <0.00%> (ø)
jupyter_server/tests/services/sessions/test_api.py 96.61% <0.00%> (+0.02%) ⬆️
jupyter_server/tests/services/kernels/test_api.py 96.52% <0.00%> (+1.60%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d201529...1da03d4. Read the comment docs.

@jasongrout
Copy link
Contributor

I'm a bit reluctant to add that back to the new protocol, since it will require parsing the content, which is what we don't want to do.

Just to document the details here, I think what you are saying is:

  1. Going from zmq to websocket, there is no problem - you can just copy the message straight from zmq to the websocket without parsing it.
  2. However, going from the websocket to zmq, you would have to parse the message since the zmq protocol splits out various fields like the header, parent_header, etc.

Yeah, I think probably your other option sounds good:

I'm thinking that we could have a server configuration option to pass a protocol version to use. That way, we could always fall back to the legacy protocol if we want to have messages sent in JSON, for debugging purposes. How about that?

Nice! Perhaps that could also be automatically turned on with the existing --debug option.

@jasongrout
Copy link
Contributor

FYI, list of websocket protocols Firefox understands and can show nicely: https://developer.mozilla.org/en-US/docs/Tools/Network_Monitor/Inspecting_web_sockets#supported_ws_protocols

@davidbrochart
Copy link
Contributor Author

  1. Going from zmq to websocket, there is no problem - you can just copy the message straight from zmq to the websocket without parsing it.

There is no parsing but we need to build the JSON text like this (with the parts coming from ZMQ):

{
  "header": parts[0],
  "parent_header": parts[1],
  "metadata": parts[2],
  "content": parts[3]
}
  1. However, going from the websocket to zmq, you would have to parse the message since the zmq protocol splits out various fields like the header, parent_header, etc.

Yes, we would receive a (potentially big) JSON object that would need to be parsed to split the parts to send to ZMQ.

Nice! Perhaps that could also be automatically turned on with the existing --debug option.

But --debug would implicitly mean "use the legacy protocol"? I think it's better to have a more explicit option, here --kernel_ws_protocol="" to use the legacy protocol, knowing that this protocol has the ability to send messages as JSON on the websocket, and thus can be used for easier debugging.
Of course if the bug is in a new protocol (not the legacy one), then that would not be useful! But most of the time it will be about debugging something at a higher level, and we must ensure that all protocols work 100% at the message passing level.

@davidbrochart
Copy link
Contributor Author

One remaining question is about rate limitation. Unfortunately, turning it on makes the new protocol almost useless, since the messages from ZMQ need to be parsed before deciding to let them through or not.
On the other hand, I'm not sure the filtering should be done in the server, since we want it to mostly be a pass-through. Another option would be to do it in the web client. IIRC, rate limitation was originally implemented to avoid flooding a notebook with a ton of output messages from other clients. JupyterLab could decide to drop messages, although it could not prevent them from transiting through the websocket.
Or we could even take the problem at its source, and do rate limitation in the kernel itself.

@maartenbreddels
Copy link
Contributor

One remaining question is about rate limitation. Unfortunately, turning it on makes the new protocol almost useless, since the messages from ZMQ need to be parsed before deciding to let them through or not.

Good you brought this up. Two things

  1. I think also jupyter-widgets messages should use a similar approach to this PR, the widget protocol should not be embedded into the JSON of the jupyter protocol, we can add some light metadata, like what the messages is about (e.g. update). But the jupyter-widget messages data should go into a binary buffer, so that the server is not aware of what we send, and it does not need to decode it.
  2. rate limiting totally break widgets, causing the state to be out of sync. I don't think we should ever drop a message (for widgets). The TCP websocket will cause backpressure, which I think will eventually cause the kernel/zmq layer to drop messages (remember the high watermark issue we had a while ago with Voila Many widgets fail to render voila-dashboards/voila#534). I think that also is a problem, but I don't think we need to solve this in the server. The front end would not even be aware of the dropped messages with the current implementation, so I'd vote for a default off.

@davidbrochart
Copy link
Contributor Author

  1. I think also jupyter-widgets messages should use a similar approach to this PR, the widget protocol should not be embedded into the JSON of the jupyter protocol, we can add some light metadata, like what the messages is about (e.g. update). But the jupyter-widget messages data should go into a binary buffer, so that the server is not aware of what we send, and it does not need to decode it.

But how to enforce the widget data to be sent in binary messages? It seems to be left to the widget developer's good will. Even the documentation seems to encourage the data to be sent in the Jupyter protocol layer: "each comm pair defines their own message specification implemented inside the data dict".

  1. rate limiting totally break widgets, causing the state to be out of sync. I don't think we should ever drop a message (for widgets). The TCP websocket will cause backpressure, which I think will eventually cause the kernel/zmq layer to drop messages (remember the high watermark issue we had a while ago with Voila Many widgets fail to render voila-dashboards/voila#534). I think that also is a problem, but I don't think we need to solve this in the server. The front end would not even be aware of the dropped messages with the current implementation, so I'd vote for a default off.

I hadn't realized rate limitation could also drop widget messages. Maybe we could let them through by including comm_msg in the list of unfiltered messages?
But having rate limitation turned off by default would be ideal for the new protocol anyway.

@jasongrout
Copy link
Contributor

jasongrout commented Jan 30, 2022

But having rate limitation turned off by default would be ideal for the new protocol anyway.

I caution against turning rate limiting off by default without addressing the underlying issue. It was put in place to address a real pain point that was not uncommon (IIRC, overwhelming the communication channel).

On the other hand, perhaps the efficiency improvements this protocol brings helps alleviate the problem?

@davidbrochart
Copy link
Contributor Author

Actually, rate limitation would only impact jupyter_server, but not necessarily other Jupyter server implementations, like Jupyverse. It makes sense to have jupyter_server and Jupyverse follow different paths, so I'm fine having rate limitation turned on by default.
I can push a new commit that has limit_rate=True by default, but I think it's good to keep this option. Without it, one had to set both iopub_msg_rate_limit=0 and iopub_data_rate_limit=0 to disable rate limitation.

@SylvainCorlay
Copy link
Contributor

Going from zmq to websocket, there is no problem - you can just copy the message straight from zmq to the websocket without parsing it.

There is no parsing but we need to build the JSON text like this (with the parts coming from ZMQ):

{
"header": parts[0],
"parent_header": parts[1],
"metadata": parts[2],
"content": parts[3]
}

Note: The current implementation does parse from zmq to websocket to compose the JSON in the form of dicts and lists. It could be smarter by manually creating the JSON string, but it does not do that.

@SylvainCorlay
Copy link
Contributor

I think this is ready to go.

I would like to address a better way to do rate limitation at a later point. The current rate limitation is a bit clumsy in that it has to parse the messages to rate limit because some messages (status idle messages) are treated in a special way to

  • reset the counter so that rate limitation does not impact long notebooks.
  • make sure that all idle messages reach the front-end because the current jupyter front-end use them to check that the requests have been processed (which is probably wrong).

@jasongrout
Copy link
Contributor

I think all of my concerns about the protocol have been resolved. Thanks again!

@jasongrout
Copy link
Contributor

jasongrout commented Feb 2, 2022

Where are we planning to document this wire format?

@davidbrochart
Copy link
Contributor Author

Probably in the jupyter_server documentation, rather than in the jupyter_client one?

@davidbrochart
Copy link
Contributor Author

Should we merge this PR and open a new one for documentation?

@jasongrout
Copy link
Contributor

jasongrout commented Feb 4, 2022

Should we merge this PR and open a new one for documentation?

That sounds great to me. Thanks again for all your work on this!

(Since it's such a big change, and I haven't been to Jupyter Server meetings recently, I don't plan to merge it myself, but I do give a +1 from my end to the current state of the protocol)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants