Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many widgets fail to render #534

Closed
maartenbreddels opened this issue Jan 29, 2020 · 36 comments · Fixed by #535
Closed

Many widgets fail to render #534

maartenbreddels opened this issue Jan 29, 2020 · 36 comments · Fixed by #535

Comments

@maartenbreddels
Copy link
Member

cc @aschlaep
Eg.

import ipywidgets as widgets
widgets.Widget.close_all()
def recursive(n):
    if n == 0:
        return widgets.Button(description='Hi')
    else:
        cls = (widgets.VBox if n % 2 else widgets.HBox)
        return cls([recursive(n-1), recursive(n-1)])
display(recursive(9))
display(len(widgets.Widget.widgets))

Will often fail to render in voila. I think it's likely to be a timing/order issue in the manager (message received before an event handler is attached), or packets being dropped (iorate limit?)

@maartenbreddels
Copy link
Member Author

I think this is a good opportunity to refactor this voila manager, to allow progressive rendering of widget (create widgets as they come in, and even try to render them as they come in). Also, we can think of connecting the websocket/kernel as early as possible, not wait till the execution is done.
I assume we can reuse quite a bit from the lab manager, I believe @vidartf and @jasongrout put quite some effort into that.

@jasongrout
Copy link
Member

@maartenbreddels - do you want to do a video call to discuss this? It sounds like a tricky issue.

@maartenbreddels
Copy link
Member Author

Notebook 6.0.2 seems to have the same issue, it worked once with n=9, but usually does not, Jupyter Lab 1.2.4 also does not render with n=8. So I don't think this is a voila issue.

Voila sees the 2558 comms/widgets, requests all their info, but gets ~2300 replies.

@maartenbreddels - do you want to do a video call to discuss this? It sounds like a tricky issue.

I'm on whereby!

@jasongrout
Copy link
Member

jasongrout commented Jan 29, 2020

I wonder if the problem is overwhelming the websocket buffer. All of these requests are made simultaneously. I don't know what the websocket buffer size is, but perhaps it is overflowing and dropping things closing the connection?

From https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/send

The WebSocket.send() method enqueues the specified data to be transmitted to the server over the WebSocket connection, increasing the value of bufferedAmount by the number of bytes needed to contain the data. If the data can't be sent (for example, because it needs to be buffered but the buffer is full), the socket is closed automatically.

Can you check what the websocket bufferedAmount is? See also https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/bufferedAmount

@jasongrout
Copy link
Member

@maartenbreddels and I talked about this more, and it seems that problem is here:

voila/js/src/manager.js

Lines 117 to 140 in 8557d9d

async _build_models() {
const comm_ids = await this._get_comm_info();
const models = {};
const widgets_info = await Promise.all(Object.keys(comm_ids).map(async (comm_id) => {
const comm = await this._create_comm(this.comm_target_name, comm_id);
return this._update_comm(comm);
}));
await Promise.all(widgets_info.map(async (widget_info) => {
const state = widget_info.msg.content.data.state;
const modelPromise = this.new_model({
model_name: state._model_name,
model_module: state._model_module,
model_module_version: state._model_module_version,
comm: widget_info.comm,
},
state
);
const model = await modelPromise;
models[model.model_id] = model;
return modelPromise;
}));
return models;
}

In particular, for his reproduction example, this gets back 2558 comm ids:

const comm_ids = await this._get_comm_info();

But this Promise.all only succeeds for 2300 or so widget state updates, and then just hangs waiting for the rest of the widget state updates:

voila/js/src/manager.js

Lines 120 to 123 in 8557d9d

const widgets_info = await Promise.all(Object.keys(comm_ids).map(async (comm_id) => {
const comm = await this._create_comm(this.comm_target_name, comm_id);
return this._update_comm(comm);
}));

He can reproduce this sort of issue in classic notebook and jlab by creating all these widgets, then refreshing the page (which triggers the same sort of logic to request all widget state from the kernel to populate the client widget manager)

@jasongrout
Copy link
Member

jasongrout commented Jan 29, 2020

I think the key is to first narrow down if the problem is in getting the update requests to the kernel, in the kernel, or getting the updates from the kernel. @maartenbreddels decided to instrument the kernel to track these state requests, to see if the kernel is getting all the requests, and if it thinks it sent all the updates. That should narrow down the problem.

One other quick fix might be to just break up Object.keys in this Promise.all so that it actually just iterates over, say, 500 widgets at a time:

const widgets_info = await Promise.all(Object.keys(comm_ids).map(async (comm_id) => {

@jasongrout
Copy link
Member

Voila sees the 2558 comms/widgets, requests all their info, but gets ~2300 replies.

Since @maartenbreddels asked - here is where 2558 comes from...

  • 2**9 Buttons
  • 2**8 + 2**7 + 2**6 + ... + 2**0 = 2**9-1 HBoxes and VBoxes
  • 2**9 Style widgets, one for each button
  • 2**9+2**9-1 Layout widgets, one for each hbox, vbox, and button

(2**9 + 2**9 - 1)*2 + 2**9 = 2558 :)

@martinRenou
Copy link
Member

martinRenou commented Jan 30, 2020

The comms are limited to 1000 msgs/sec by default in the Notebook. This might be the reason for this issue.

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

@maartenbreddels
Copy link
Member Author

Since @maartenbreddels asked - here is where 2558 comes from...

Thanks, that helped sleeping :)

Ok, I've narrowed down the issue a bit, by patching jupyter_server like this:

$ git diff jupyter_server/services/kernels/handlers.py                                                                                                                               base
diff --git a/jupyter_server/services/kernels/handlers.py b/jupyter_server/services/kernels/handlers.py
index 23d763a84..97fb88f43 100644
--- a/jupyter_server/services/kernels/handlers.py
+++ b/jupyter_server/services/kernels/handlers.py
@@ -211,6 +211,9 @@ class ZMQChannelsHandler(AuthenticatedZMQStreamHandler):
         # Allows you to specify that the byte count should be lowered
         # by a delta amount at some point in the future.
         self._iopub_window_byte_queue = []
+        import collections
+        self.counters = collections.Counter()
+        self.request_order = {}

     @gen.coroutine
     def pre_get(self):
@@ -301,6 +304,11 @@ class ZMQChannelsHandler(AuthenticatedZMQStreamHandler):
         else:
             msg = json.loads(msg)
         channel = msg.pop('channel', None)
+        content = msg.get('content', {})
+        if channel == 'shell' and content.get('data', {}).get('method') == 'request_state':
+            self.request_order[content['comm_id']] = self.counters['request_state']
+            self.counters['request_state'] += 1
+            print(self.counters)
         if channel is None:
             self.log.warning("No channel specified, assuming shell: %s", msg)
             channel = 'shell'
@@ -329,6 +337,13 @@ class ZMQChannelsHandler(AuthenticatedZMQStreamHandler):
             self.write_message(json.dumps(msg, default=date_default))
         channel = getattr(stream, 'channel', None)
         msg_type = msg['header']['msg_type']
+        content = msg.get('content', {})
+        if channel == 'iopub' and content.get('data', {}).get('method') == 'update':
+            self.counters['update'] += 1
+            print(self.counters)
+            del self.request_order[content['comm_id']]
+            if self.counters['update'] > 1800:
+                print("missing", self.request_order)

         if channel == 'iopub' and msg_type == 'status' and msg['content'].get('execution_state') == 'idle':
             # reset rate limit counter on status=idle,

I can conclude the following

  • we get 2558 request_state messages, so we get all websocket sends from the browser.
  • we get < 2558 update messages, so either the kernels does not receive them (not the case, see below), or the server does not receive them (either the kernel does not send a reply, or the reply does not get through).

I've analyzed the request_order dict, and in 1 run, the first 1165 request_state calls got answered, the next 130 got dropped, the next 500 (or 501) got answered, then 459 dropped, and the last 305 dropped.

from the kernel

I monkey patch ipykernel and the Comm class to see what we do at the kernel side:

Counter({'update': 2558, 'request_state': 2558})

So we get all requests , and we send all the updates, but they don't arrive at the server.

@maartenbreddels
Copy link
Member Author

The comms are limited to 1000 msgs/sec by default in the Notebook. This might be the reason for this issue.

Good idea, but I don't see any warning (could always end up in /dev/null thought). But the tracking of the reply happens before the rate limiting (i.e. the limiting is between the frontend and server, and I already lost the message between the kernel and the server)

@jasongrout
Copy link
Member

The iopub rate limiting is at the server on messages going from the kernel to the browser. That seems to be where you are dropping messages? Can you up the rate limits to see if it solves the issue, just in case?

@jasongrout
Copy link
Member

To up the limit to 10x its default: jupyter lab --LabApp.iopub_msg_rate_limit=10000

@maartenbreddels
Copy link
Member Author

I see

  • 2558 request_state msg'es going from browser->server
  • 2558 request_state msg'es arriving at the kernel
  • 2558 update calls from the kernel
  • < 2558 update messages received at the server.

I intercept the messages at the server at:
https://github.com/jupyter/jupyter_server/blob/7a6cd8e5dece1a8084f6cf065188c6a004f0a38a/jupyter_server/services/kernels/handlers.py#L317

So I think this is before any rate-limiting is applied.

So I think either the zmq layer does not send the packets, or the server does not receive them.

@maartenbreddels
Copy link
Member Author

kernel = get_ipython().kernel
kernel.iopub_socket.io_thread.socket.get_hwm()

Gives us 1000, meaning it will buffer 1000 messages, before dropping messages, as explained at: http://api.zeromq.org/2-1:zmq-setsockopt (ZMQ_HWM: Set high water mark).

Patching ipykernel, and removing that limit self.iopub_socket.set_hwm(0) will make voila get all the messages 🎉. But, this call needs to happen before binding the socket, so I don't think a notebook user can do it.

@jasongrout
Copy link
Member

jasongrout commented Jan 30, 2020

ah, nice!

By the way, here is a nice small program for spying on a specific kernel directly, in case you want to do something like count the iopub messages coming directly from the kernel zmq socket. Use it like python ./client.py <kernel-id>

from collections import Counter
from jupyter_client import BlockingKernelClient
from jupyter_core import paths
import os.path as path
import sys

connection_file = path.join(paths.jupyter_runtime_dir(), 'kernel-{}.json'.format(sys.argv[1]))
print(connection_file)

client = BlockingKernelClient()
client.load_connection_file(connection_file)
client.start_channels()
counter = Counter()

while True:
  msg = client.get_iopub_msg(timeout=100)
  counter[msg['msg_type']] += 1
  print(counter)

@jasongrout
Copy link
Member

Patching ipykernel, and removing that limit self.iopub_socket.set_hwm(0) will make voila get all the messages 🎉. But, this call needs to happen before binding the socket, so I don't think a notebook user can do it.

Where is the high water mark set? ZMQ apparently defaults to no limit, according the docs you linked to.

@maartenbreddels
Copy link
Member Author

Yeah, puzzled by that as well. ipython/ipython#3304 says v3 puts it by default to 1000, I am not sure what v4 does, digging into this now.

@martinRenou
Copy link
Member

Patching ipykernel, and removing that limit self.iopub_socket.set_hwm(0) will make voila get all the messages tada. But, this call needs to happen before binding the socket, so I don't think a notebook user can do it.

Do we really want to patch this? What was the original reason for limiting to 1000msgs/sec?

If we change this here, I guess we also need to change the iopub_msg_rate_limit to the same value in JupyterLab, Jupyter Notebook and all other front-ends.

@jasongrout
Copy link
Member

Do we really want to patch this? What was the original reason for limiting to 1000msgs/sec?

I think our first fix is just batching our update requests to try to be under this limit, which is a change to the widget manager.

@maartenbreddels
Copy link
Member Author

I'm not saying we want to do that, but we want to understand the issue, and maybe have workarounds/configurations for people that need it.

@maartenbreddels
Copy link
Member Author

I think our first fix is just batching our update requests to try to be under this limit, which is a change to the widget manager.

I think this is an interesting idea, so you say we should change the widget protocol?

@jasongrout
Copy link
Member

jasongrout commented Jan 30, 2020

No, just that we should do what I mentioned above in #534 (comment)

One other quick fix might be to just break up Object.keys in this Promise.all so that it actually just iterates over, say, 500 widgets at a time:

const widgets_info = await Promise.all(Object.keys(comm_ids).map(async (comm_id) => {

@maartenbreddels
Copy link
Member Author

ZMQ_SNDHWM at http://api.zeromq.org/4-3:zmq-setsockopt says a 1000, so indeed. Should we discuss this at ipykernel, open an issue there?

@maartenbreddels
Copy link
Member Author

One other quick fix might be to just break up Object.keys in this Promise.all so that it actually just iterates over, say, 500 widgets at a time:

Ah, now I understand, that's a great idea! Or, have at most 500 unanswered requests, using a Semaphore-like mechanism.

@jasongrout
Copy link
Member

According to the zmq page:

ØMQ does not guarantee that the socket will accept as many as ZMQ_SNDHWM messages, and the actual limit may be as much as 90% lower depending on the flow of messages on the socket. The socket may even be able to accept more messages than the ZMQ_SNDHWM threshold; a notable example is for sockets using TCP transport; see zmq_tcp(7).

So actually the message queue may be far smaller than 1000. Wow, sounds like we either need to be quite conservative, or raise the limit. Yes, let's start an issue on ipykernel and CC @minrk.

@jasongrout
Copy link
Member

Ah, now I understand, that's a great idea! Or, have at most 500 unanswered requests, using a Semaphore-like mechanism.

I was thinking just putting it in a loop and awaiting the resolutions in batches, perhaps batches of 100, given the hwm limit may actually translate to something like 100 messages in certain cases?

jasongrout added a commit to jasongrout/ipywidgets that referenced this issue Jan 30, 2020
…age limit in the kernel.

Current ZMQ by default limits the kernel’s iopub message send queue to at most 1000 messages (and the real limit can be much lower, see ZMQ_SNDHWM at http://api.zeromq.org/4-3:zmq-setsockopt). We now request comm state in batches to avoid this limit.


See voila-dashboards/voila#534 for more details, including where this was causing real problems.
@jasongrout
Copy link
Member

I posted a PR to jupyter-widgets/ipywidgets#2765

jasongrout added a commit to jasongrout/ipywidgets that referenced this issue Jan 30, 2020
…age limit in the kernel.

Current ZMQ by default limits the kernel’s iopub message send queue to at most 1000 messages (and the real limit can be much lower, see ZMQ_SNDHWM at http://api.zeromq.org/4-3:zmq-setsockopt). We now request comm state in batches to avoid this limit.


See voila-dashboards/voila#534 for more details, including where this was causing real problems.
@TristanKnox
Copy link

Was this issue fixed? I am currently having an issue where a notebook that has multiple tabs each with many widgets and some plotly plots will only sometimes render when served with voila.
If i run voila with --debug i get hung up here.

'''
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: comm_msg
[Voila] activity on 472fba6d-78c7-4670-a6c0-d271f82afbff: status
'''
Any suggestions?

@maartenbreddels
Copy link
Member Author

We're currently fixing this in #546 It was fixed only in the 0.1.18.x branch

@maartenbreddels
Copy link
Member Author

This is now fixed in voila 0.1.21, let us know if you still have issues!

@TristanKnox
Copy link

This seams to have made a difference. So far, it has not failed to render.

jtpio pushed a commit to jtpio/ipywidgets that referenced this issue Mar 26, 2021
…age limit in the kernel.

Current ZMQ by default limits the kernel’s iopub message send queue to at most 1000 messages (and the real limit can be much lower, see ZMQ_SNDHWM at http://api.zeromq.org/4-3:zmq-setsockopt). We now request comm state in batches to avoid this limit.


See voila-dashboards/voila#534 for more details, including where this was causing real problems.
@tilusnet
Copy link

tilusnet commented Feb 23, 2022

Hi.

I am posting here as I am asking advice.

I seem to be having widget refreshing issues similar to the descriptions in this ticket, however the problems are strongly correlated with having started voila with preheated kernels.

How would you diagnose it?

Using voila 0.3.1.

@trungleduc
Copy link
Member

Hello @tilusnet, can you open another ticket and provide us with a minimal example to reproduce the issue?

@tilusnet
Copy link

Hi @trungleduc

Sadly it's not something consistently reproducible.
And as I said earlier, I don't know what to measure and how. I only notice symptoms of failed widget refresh.

What I can tell you about my setup:

  • the page is rather widget-complex, though wrt refresh rates, all widgets (like progress bars) are limited to 1 refresh per 0.5 seconds. I am using tqdm for progress bars.
  • I can confirm that preheating a page does not involve any progress bar refresh since these are triggered on button clicks

@tilusnet
Copy link

tilusnet commented Mar 12, 2022

FYI @trungleduc -
The widget refresh issues I mentioned earlier seem to be strongly linked to using ipywidget's .capture() decorator.

Someone could test as per the documentation example here:
https://ipywidgets.readthedocs.io/en/stable/examples/Output%20Widget.html#Debugging-errors-in-callbacks-with-the-output-widget

+1 matter that contributes: activation of preheated kernels.

@maartenbreddels
Copy link
Member Author

#1101 could be it, it certainly is a bug that can cause they page to fail to render

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants