-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]Add method for kernel manager to retrieve statistics #407
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates. I think the result of kernel_stats()
should return actual result values, not functions that produce the results later. That approach could be problematic depending on where the consumer of kernel_stats()
resides (which, I'm assuming, may not be in the notebook process at all if this where to be plumbed to a handler).
What is the reason for returning functions instead of the snapshotted results? Is the idea that the callers would then have the particular functions in their hands and they can then just call those methods are various intervals? If so, I think it's better to have the callers call the public kernel_stats()
method periodically and leave the various functions that produce the results hidden from the caller.
jupyter_client/manager.py
Outdated
different stats about the kernel. | ||
""" | ||
return { | ||
"memory_usage": self._memory_usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return the results of _memory_usage()
, otherwise we'll get into issues for when the consumer tries to get at the memory usage.
kernel_stats()
should be the method that can be overridden, not _memory_usage()
. This way subclasses could even extend the json results with their provider-specific details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minrk thanks for the review!
I get what you are saying, BUT, few things:
you have said That approach could be problematic depending on where the consumer of kernel_stats() resides
, could you give me an example for such scenario?
Also, you said, callers would then have the particular functions in their hands and they can then just call those methods are various intervals? If so, I think it's better to have the callers call the public kernel_stats() method periodically and leave the various functions that produce the results hidden from the caller.
Yes that was the intent of returning only the function object.
If the json that the function kernel_stats
returns will grow,
then it means that the time it takes to collect all of the data will increase, meaning that letting it run all of the data collection as soon as it is executed will take more time.
Also, consider that different places that use the function will need different types of data.
Not all of the uses for the kernel_stats
will use all of the json values, meaning that they are generated for nothing.
So, if for example, lets say we have the need to check every few seconds the ram of the kernel, but we have also added other types of checks into the json, and they all run as soon as you call the function kernel_stats
, we will spend getting unneeded data just to get that one value from the json, which will result in a performance decrease.
I think that letting the user choose the kind of data that he will get from the function is much preferred in terms of performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Hyaxia - I'm not sure if you're confusing me with Min, but, if so, I'm flattered! 😃
I apologize if this is getting convoluted, which it seems to me that it is, but part of my issue is understanding how the results of this function are actually consumed. I figured, given your original PR to notebook, that notebook would be calling this method, BUT that the call in Notebook would be performed on behalf of a web handler so that the consumer of the notebook would actually trigger the stats gathering. So, this would eventually be plumbed as a REST call that web clients invoke to get the kernel usage statistics - one of which is memory consumption.
So, if kernel_stats
returns a json of functions (and not the results of those functions), those "handles" only pertain to the notebook process itself. As a result, the web handler would need to invoke those functions since that's the last chance that the json will provide meaningful context because the function handles context will be lost otherwise. That's okay, but it strikes me as odd since those functions can be called within the kernel_stats()
method itself.
I view this as a method that is called on the order of 5-10 second intervals (or even triggered by a manual "button" to get current stats), so the frequency is not very often in the grand scheme of things. However, to your point, no "category" of stats should take more than a few dozen milliseconds to capture. Even 100 ms is a long time, so I would hope that any other interesting categories are also inexpensive - and that should be a criteria for that categories' inclusion in the json.
If we preserve that goal that none of the stats are expensive, then it should be fine to be able to call this method on 5-10 second intervals w/o detriment. In addition, other implementations that MUST override kernel_stats()
because the kernel is remote, etc. absolutely cannot return functions since those functions do not apply to the caller's context. As a result, the boundary of work must be the public function - kernel_stats()
.
Btw, when remote access is required to get these stats, the latency of having to go remote in the first place argues that as much information should be gathered at that time, since the cost of the stats call is far greater than the cost to gather the different categories in the first place.
I hope that clarifies things a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin-bates haha sorry for the confusion 😆
That's okay, but it strikes me as odd since those functions can be called within the kernel_stats() method itself.
I agree, it might also make things complicated 😐
so the frequency is not very often in the grand scheme of things
This is really depending on the use case.
If we preserve that goal that none of the stats are expensive, then it should be fine to be able to call this method on 5-10 second intervals w/o detriment
In my opinion it is pretty restricting.
You might have a lot of different types of data gathering added in the future, which will mean that there might come a time where you would need to decide what type of data gathering to leave in and what to take out because all of the functions inside kernel_stats
will execute and take too much time.
In addition, other implementations that MUST override kernel_stats() because the kernel is remote, etc. absolutely cannot return functions since those functions do not apply to the caller's context. As a result, the boundary of work must be the public function - kernel_stats().
Yes I agree, let me make myself clear a bit more, I ment that the user should override also the _memory_usage
function. You wouldn't want to leave the implementation of the function as is and create a new one leaving the old one standing there.
Btw, when remote access is required to get these stats, the latency of having to go remote in the first place argues that as much information should be gathered at that time, since the cost of the stats call is far greater than the cost to gather the different categories in the first
I agree.
I thought about it some more, let me propose a new implementation.
It is sort of a combination of the two approaches.
The function kernel_stats
will receive a list of strings.
Each string representing the type of data that the caller wants to get.
Iterating Over the different types, we will execute the appropriate functions.
This way:
- The execution and gathering of the data is happening inside the
kernel_stats
function. - We execute only the functions that are necessary, this leaves the user much more flexibility because he can specify exactly the data. Because the user specifies the data that is collected, we don't have the restriction of the time for the data gathering functions the
kernel_stats
exposes.
Because of no time limit, we can implement gathering function that take much more time, leaving the user of the function to try and see what time frames are good for him for his use case. - Also there will never be a problem to add another kind of data gathering, since the user is the one responsible for specifying his needs, leaving him responsible for the time it takes to run the data gathering.
- The user of the function must know what kind of data he is looking for.
I will make a new commit soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Hyaxia - this seems like a good approach - thank you.
I'm still unclear who/what the actual client/caller of this method is and how they ultimately get the data. I'm assuming this will become exposed in the api/kernel/{kid}
endpoint somehow or a api/kernel/{kid}/stats
endpoint is added. As of now, this method is usable as a form of diagnostics that is added ad-hoc within the NB server code.
I think the list of values to gather will want to be exposed via a config parameter (e.g., --KernelManager.kernel_stats='memory_stats, latency'
), but that can be added when a formal call to this method is implemented. In addition, there should be a way to skip this call - perhaps if the config parameter is not defined?
Since the categories are now specified, their names become more important. Looking at the command line example, it looks like memory_stats
should become memory
since stats is redundant. Or the suffix could be finer grained, memory_vm
, memory_rss
, etc. relative to which statistic is being returned. Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others can correct me if I'm wrong, but my understanding is that every sub-class of Configurable
supports the ability to specify runtime options via the command line and/or configuration file. The command line syntax is --<className>.<option>=<value>
and config file is c.<className>.<option>=<value>
. So for this, the option could be defined on KernelManager
or MultiKernelManager
(probably depending on where the kernel_stats()
method will be called from) and look something akin to this code in Notebook.
kernel_statistics = List(Unicode(), config=True,
help=_("""Comma-separated list of support categories to gather relative to the active kernel. No statistics are gathered if option is not defined. Categories are: memory""")
)
You can then specify validation methods to ensure the values are valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if I got it right,
So by adding the lines you wrote above,
kernel_statistics = List(Unicode(), config=True, help=_("""Comma-separated list of support categories to gather relative to the active kernel. No statistics are gathered if option is not defined. Categories are: memory""") )
We will then check if there is a configuration and if there is then use it like that code below?
if kernel_statistics:
stats = kernel_statistics
If so, how do these configuration tools work together with the *stats
parameter that the kernel_stats
receives? wont there be a conflict between those two? I'm not so familiar with the concept 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, although kernel_statistics
(or whatever you call it) would be referenced in the code via self
and essentially becomes the stats
parameter indicating what to capture.
if self.kernel_statistics and len(self.kernel_statistics) > 0:
return km.kernel_stats(self.kernel_statistics)
You could also define a @validate
method that is called to validate the option's value when encountered during application launch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the traitlet.
A question,
how would you handle the exceptions that accrue inside the kernel_stats
function?
ATM I'm simply logging it and stopping the execution, while returning the data that was collect so far.
btw, I dont think there is a need for a validation here because we just skip any keys that are
not found, and the validation of the traitlet being a list of strings is a default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Logging and continuing to gather the rest of the keys on exceptions is probably the right thing for this since its a background "thread" anyway. The important thing is to document the expected behavior for errors and unrecognized keys. You could return a <key>_errmsg
field containing the applicable exception message, but we should see how this pans out first before going down that road.
jupyter_client/manager.py
Outdated
if stats and len(stats) > 0: | ||
return get_stats(stats) | ||
# If no input was passed but there is a default configuration | ||
elif self.kernel_statistics and len(self.kernel_statistics) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this elif
is necessary. The caller should be using self.kernel_statistics
as the stats parameter to this method call, so if stats
is None or empty, then it seems like this method should return the empty set.
I suppose the other approach, now that kernel_statistics
is defined, is to not have a stats
parameter at all and use kernel_statistics
entirely in this method. But having the stats
parameter would allow callers to capture a subset of the supported categories that are specified in the configuration option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it some more, why would it be better to have this trait?
Wont it be better to just let the consumer of the function to decide what he wants to get,
without starting to handle some sort of configuration?
I see the reason for using a trait in a case of something that stays kind of static, for example the notebook_dir
that you have linked the validation method to.
But this method is supposed to be flexible, we should not restrict the user in the choosing of the data he wishes to collect.
But by initializing the kernel_statistics
we make it harder for the consumer of the function to quickly and easily change the data he gets.
I don't think this elif is necessary. The caller should be using self.kernel_statistics as the stats parameter to this method call, so if stats is None or empty, then it seems like this method should return the empty set.
If we tell the user to call the function with the kernel_statistics
, then it becomes much less flexible, because every little change we will have to change the parameter, instead of just changing the input to the function (also applies to new uses of the function).
What are the pros of having such trait? seems to me like its restricting in this case, whereas *stats
is not.
The only use case that i found it useful, is as I wrote, # If no input was passed but there is a default configuration
, (as a default config)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for the delayed response. I guess I'm having difficulting understanding how the consumer of this method is implemented. I was under the impression these stats would be called, probably from the notebook server code periodically. Since that code is baked into the server, how does the end user control or influence what stats are gathered? Nothing has been defined thus far that makes what is gathered quick and easy to change.
I figured by exposing the "stat categories" via a configuration option, then that would be how a given installation determines what stats are important and should be gathered. In addition, this trait also serves as an enablement switch so that this functionality can be disabled (by default). I don't follow why an installation would want to gather different stats dynamically. What is gathered seems like a static thing to me.
At this point, I feel like things have gone sideways a bit and I apologize if you're feeling frustrated. I believe the important things are that this functionality is disabled by default, what is gathered can be controlled by an admin, and all stats relative to a managed kernel are obtained in a single request. I believe the current incantation addresses those points, but if you'd prefer to take this a different way, that's fine as well.
Perhaps @rgbkrk could lend some direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin-bates Its all good, I am all for having a debate over this.
Let me explain to you how I see this functionality being used, maybe then, you will see why I think that letting this function be dynamic is important.
-
Notebook server code that calls the function.
In this use case, there is a classic example that started this whole idea and that is the prometheus metric for memory usage of the kernel.
Image that we have different types of metrics that we want to let the prometheus agent to collect.
Those metrics are exposed through this function.
If you look at how prometheus collects different metrics, you will see that it is not happening all in one place.
different metrics are collected at different intervals, in different places of the notebook server code.
Thus, we have to let the code in the notebook server, the flexibility, the freedom to choose what he wishes to collect. -
A frontend button that on a user click will return a visual representation of the kernel statistics.
This freedom of choice is given by the *stats
parameter.
This is why I think that removing the *stats
parameter will go against the main reason for all of this.
From what I have seen such flexibility is not wise to implement with the configuration of traitlets, which are supposed to be static as soon as you start the server.
Are we on the same page here? or is there something still unclear about my perspective?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I have seen such flexibility is not wise to implement with the configuration of traitlets, which are supposed to be static as soon as you start the server.
As an operator, I definitely prefer config traitlets that stay the same after starting the server.
I haven't read back into this PR much, but I'd say that we definitely want jupyter_client
to behave more and more like a library (accepting options as objects passed in). Only applications should use traitlets for config. cc @minrk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that we definitely want jupyter_client to behave more and more like a library (accepting options as objects passed in). Only applications should use traitlets for config
Cool I agree.
For a second, lets say I remove the traitlet from the code.
We are left with the implmenetation of the method kernel_stats
that accepts *stats
as prameter, and returns the wanted gathered data types as long as it is possible.
Now, my last concern is that, it might be better to not complicate things, and just leave the memory_usage
method as is, without encapsulating it inside the kernel_stats
method.
But the thing is, I do see the positive side of having one method that encapsulates all of the gathering methods inside of it.
It creates a defined one way of getting the statistics that you want from the kernel, a json that you know what to expect inside.
Also it will save lines of code.
It feels robust, which is good, but it comes in place of simplicity in my opinion.
There is some hidden mechanism inside the function kernel_stats
whereas in plain calls directly to the memory_usage
method there aren't.
@rgbkrk , @kevin-bates , thoughts?
Thanks for your response(s), I found them helpful. The idea that jupyter_client is a library makes sense to me as well and hadn't looked at it that way previously. Also helpful is understanding that the request to ultimately trigger the gathering of a kernel's statistics comes directly from the front-end. Your initial PR, as I recall, was calling the memory_stats method every half second or so. I still feel that a generic method for collecting the kernel-specific stats is important. As I've mentioned previously, we don't want to be calling individual kernel-specific stats methods since the call to each of those methods is expensive - particularly when we're talking about remote kernels and/or frequent intervals. A single method also makes overrides easier - in addition to the encapsulation it provides (both of which I view as positive). What I'm still not following, is how the Notebook user conveys the value of the *stats parameter? I think it might be better to let the method return what it can. In fact, some stats may not be applicable or even possible, depending on where the kernel resides, how it was launched, etc. I viewed this feature more as an admin thing where an admin sits at a dashboard viewing all the kernels (note that enterprise and kernel gateway configurations manage kernels on behalf of multiple users). In this case, we'd want more of a polling collection mechanism so that when the admin requests stats, that REST call simply hits the MultiKernelManager (or subclass) and it walks the stats fields of each of the active KernelManager instances. Of course, even this could vary depending on how prometheus works. Either way, gathering of stats across multiple kernels should probably be an asynchronous mechanism. What about this approach...
By implementing this by hitting the Multi/MappingKernelManager, I believe we keep the door open for future considerations - like an overall dashboard across all kernels. If a polling mechanism is used to periodically gather stats across all kernels, we will need to be careful how long that takes and it's probably a good idea to implement that endpoint as an async method if possible. This is always why a longer interval is recommended. At any rate, that's my (latest) two cents 😃 and Happy New Year! 🎉 |
@kevin-bates Thanks for the suggestion, and happy new year!
Yeah I agree, I remember now that we have also talked about this before, but there was so much we talked about that I had already forgot some 😅
As I said, the frontend use case is only one of many others that might come, thus the function has to be as flexible as possible.
I disagree with this approach. With all this in mine, let me perform a new commit. |
The way its implemented atm: The function |
@ivanov can you take a look here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really very sorry that I didn't see this earlier (and even more sorry for not making the time to push forward the earlier prototype I had for user-defined asynchronous metric). This looks good to me - you've added synchronous polling of metrics to the kernel manager at the API level, without introducing any changes to the protocol. 👍
@ivanov Great! I am glad to hear that it looks good. Will commit the changes in a few days from now (hopefully no more than 3 days). |
Awesome thank you both. |
It looks like something makes the tests fail, not something that has anything to do with the changes I have made. @rgbkrk @ivanov It will be awesome if you could look again at the changes I have made. For any question regarding the decisions of the design ask right away, I will answer as soon as I can. |
I'm assuming you've rebased already. Does the error occur on |
@rgbkrk well from what I have seen, it happens in every PR that tries to do anything. In #415, it fails with |
Any addition of a new metric will not effect the `KernelManager` class at all (see how its designed).
@rgbkrk btw, now I have made the rebase and its ready to be merged, unless there is anything you want to address about the design. |
The pytest issue is fixed in master if you want to rebase (sorry for asking again, but there are no conflicts at least. We can rebase for you, if you prefer). The added test fails after rebasing because psutil is not a dependency of jupyter-client yet, so this would either need to be added as a strict dependency, or this functionality made optional. The use of ABC makes me think that these metrics are meant to be discoverable/extensible, is that true. Perhaps instead of using ABC subclasses, the listing should be discovered via entrypoints? That's how extension points are found elsewhere in Jupyter, and it's working rather well. |
@minrk Thanks for the response. Now for the second part -> |
that's okay, we'll just have to make sure that the metric is unavailable without an error if psutil is missing. That could be gating the declaration of the class on importing psutil, or perhaps better, allowing metrics to return or raise a special value, e.g. NotImplemented, or a special jupyter_client.metrics.NotAvailable when whatever's requires is missing. If it's not a hard requirement, it should be added as a test requirement so that the tests can run.
This isn't true in practice, though, because they aren't recognized as subclasses until the classes have been defined, which means the modules must be imported first before discovery can find them. Entrypoints are discoverable as soon as they are installed, even if they aren't imported yet. That, combined with the fact that we already are using entrypoints for this kind of thing elsewhere (JupyterHub Spawners, nbconvert Exporters, etc.) makes me think that entrypoints might be a better mechanism. What do you think? The last issue, I think, is that KernelManagers need to define their metrics classes somehow, or capabilities, at least. A metric that works with a pid isn't going to work with a remote KernelManager like Enterprise Gateway. Automatic discovery of kernel metrics can't be done for KernelManagers in general, since the metrics implementations are adding requirements on the up-to-now private implementation details of the KernelManager. |
@minrk I am not sure how it works in practice, could you please point me to the implementations of entry points that you mentioned? And a clarification: Thanks for the feedback!!! |
Now the class KernelMetricStore acts as a store of all the metric classes that inherit from `KernelMetric` class, so no matter from where the new metric class is defined, the moment it locates it, the class will be added to the `TYPES` dictionary of the KernelMetricStore class and you will be able to poll it through the `kernel_metrics` function of the `KernelManager` class.
Notifications from jupyter/notebook#3682 brought me back to this PR. |
A new method called 'memory_usage' that checks the kernel ram usage
locally.
Added a test for the method and inserted the psutil package (definitely not sure if its a good test, there might a better check to perform)
as a new dependency for the project.
Background:
The reason i want to add this method to the kernel manager is that in the notebook project we
wanted to create a functionality that lets us query the ram usage of the usage of the kernels
that are active.
In the PR jupyter/notebook#4075 it has been brought to my attention that implementing it in notebook will break
applications that use the kernels remotely.
For that reason I have implemented the method for checking the ram the way I did,
importing the package 'psutil' inside the method so if the implementation is swapped out
and wont be using the package anymore, it wont bother anyone.