-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WDDM metrics #817
WDDM metrics #817
Conversation
this adds a new collector named nvidia which uses the C nvml library to query and expose per process memory usage
to disable useless polling of wddm devices that will not report valid memory metrics, detect driver type and skip devices that are currently using wddm driver
* introduce total_process_gpu_memory_used (WDM only( / total_gpu_memory_used to expose per gpu and per process metrics
What is the best practice - should I squash the whole feature branch, merge it in my master, rebase it and then open a PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @rubu!
There's some code-organizational parts I've mentioned that we need to talk about. Also, I suppose our CI currently lacks a lot of libraries to build this. Any idea what we need to install?
#include "nvml.h" | ||
#include <windows.h> | ||
|
||
// definition of required import symbols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a pattern of putting Cgo-bits in the headers/
directory, and expose a "nice" Go interface from there, so the collector logic gets easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i can rework that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, at this point the repo does not have a headers/
directory. Thus I have no idea what exactly should go in there. Can you give me a link to any other project/repo that has such a structure for an example? Sorry, I'm primarily a C++ dev and really do not know the guidelines for Go/C interop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, perhaps you need to rebase? Looks like this branch started off in February and it might not have existed then, but it does now. There's some background in this PR
const subsystem = "nvidia" | ||
|
||
if *processWhitelist == ".*" && *processBlacklist == "" { | ||
log.Warn("No filters specified for nvidia collector. This will generate a very large number of metrics!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a copy/paste, or will there actually be a lot of metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not a copy/paste, since my initial thought was that basically any process that is using the GPU (thus for minimum any process that has a GUI) will be exposed here - ofc there won't be as many per process metrics as in the process collector, but the actual entries will be as many as there are processes using the GPU. Again, something we can discuss and work out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then no worries! It's just been a rather frequent occurrence historically that contributors start off by copy/pasting one of the collects that generate a lot of metrics and gets this log line by chance, so to speak. From your description it sounds warranted then, I wasn't sure how many matches it would result in.
|
||
return &nvidiaCollector{ | ||
TotalProcessGpuMemoryUsed: prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, subsystem, "total_process_gpu_memory_used"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is good to indicate the unit of measure in the name, so eg total_process_gpu_memory_used_bytes
(I guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok noted, will fix.
nil, | ||
), | ||
TotalGpuMemoryUsed: prometheus.NewDesc( | ||
prometheus.BuildFQName(Namespace, subsystem, "total_gpu_memory_used"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit here too
continue | ||
} | ||
if currentDriverModel == C.NVML_DRIVER_WDDM { | ||
log.Warnf("Gpu %s is using WWDM driver that does not allow collecting per process memory usage\n", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we just continue
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also the TotalGpuMemoryUsed, which does not care about WDDM / TCM mode. That is why the continue is not there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright!
) | ||
|
||
func init() { | ||
registerCollector("nvidia", newNvidiaCollector, "NVIDIA") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't seem to unmarshal the NVIDIA
counters you request here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this is literally just an error, since I don't think I initially understood what that parameter does, will remove it.
@@ -2,11 +2,78 @@ | |||
|
|||
package collector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the amount of code and requirement for extra libraries, I'd be inclined to suggest we put the per-process GPU metrics in a separate collector. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, well, as for the extra libs, this only uses DXGI.lib, which is present in Windows/System32. The data comes from perflib, the extra code is just to match the GPU luids with the corresponding processes, so since I perceive the process collector as something that takes data from perflib, makes sense for me that this stuff lives here, but again we can talk about it.
@@ -2,11 +2,78 @@ | |||
|
|||
package collector | |||
|
|||
/* | |||
#cgo LDFLAGS: -lDXGI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned previously, this'd be nice to put in headers/ with a wrapper
@carlpett thanks for getting back, so let's try and resolve the points and I'll fix the stuff that needs fixing :) Since I already made some bugfixes it actually became annoying that I have to build this from source every time and redistribute the binary, if the proper way would be to use the official releases. Btw one thing I stumbled upon last week - the promu build command actually silently excluded all the cgo stuff since the yaml file was missing:
which I added. So up to this point windows_exporter did not anyhow depend on any C bindings? |
Hm. It does, eg for the |
Just a short question: is this pull request still active or not? |
@okopanja I didn't manage to refactor/split all the stuff, and it seems like there is not much need for it, so I can just keep it as is in my local fork. Are you interested in it or just tidying up? In case of the former, I can try and finally make everything nice and repush, otherwise I guess this can be tidied up. |
I'd love to see this PR pushed over the line |
@ChandonPierre Hey, sorry for abandoning work on this - simply lack of time and it was easier to live with a local fork since the WDDM/GPU metrics were all that I needed, but can try to find time and give the refactor another go, but I can't make any promises on the timeline. Anyone else maybe interested in helping out? |
"unsafe" | ||
|
||
"github.com/prometheus/client_golang/prometheus" | ||
"github.com/prometheus/common/log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be "github.com/prometheus-community/windows_exporter/log"
Hi @rubu - do you have some time to finish this up? I am interested in having some gpu metrics on windows endpoints. |
I'm opening a PR for feature that collects GPU metrics via WDDM in the perflib collector (initially the idea was to collect NVIDIA GPU metrics via NVML but that does not work with cards in WDDM mode, but that functionality is in this PR as well).