Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WDDM metrics #817

Closed
wants to merge 10 commits into from
Closed

WDDM metrics #817

wants to merge 10 commits into from

Conversation

rubu
Copy link

@rubu rubu commented Jul 14, 2021

I'm opening a PR for feature that collects GPU metrics via WDDM in the perflib collector (initially the idea was to collect NVIDIA GPU metrics via NVML but that does not work with cards in WDDM mode, but that functionality is in this PR as well).

this adds a new collector named nvidia which uses the C nvml library to query and expose per process memory usage
to disable useless polling of wddm devices that will not report valid memory metrics,
detect driver type and skip devices that are currently using wddm driver
* skip device if driver model cannot be determined
* initialize pid memory map
* introduce total_process_gpu_memory_used (WDM only( / total_gpu_memory_used to expose per gpu and per process metrics
@rubu rubu requested a review from a team as a code owner July 14, 2021 14:50
@rubu
Copy link
Author

rubu commented Jul 14, 2021

What is the best practice - should I squash the whole feature branch, merge it in my master, rebase it and then open a PR?

Copy link
Collaborator

@carlpett carlpett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @rubu!
There's some code-organizational parts I've mentioned that we need to talk about. Also, I suppose our CI currently lacks a lot of libraries to build this. Any idea what we need to install?

collector/nvidia.go Show resolved Hide resolved
#include "nvml.h"
#include <windows.h>

// definition of required import symbols
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a pattern of putting Cgo-bits in the headers/ directory, and expose a "nice" Go interface from there, so the collector logic gets easier to read

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i can rework that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, at this point the repo does not have a headers/ directory. Thus I have no idea what exactly should go in there. Can you give me a link to any other project/repo that has such a structure for an example? Sorry, I'm primarily a C++ dev and really do not know the guidelines for Go/C interop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, perhaps you need to rebase? Looks like this branch started off in February and it might not have existed then, but it does now. There's some background in this PR

const subsystem = "nvidia"

if *processWhitelist == ".*" && *processBlacklist == "" {
log.Warn("No filters specified for nvidia collector. This will generate a very large number of metrics!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a copy/paste, or will there actually be a lot of metrics?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not a copy/paste, since my initial thought was that basically any process that is using the GPU (thus for minimum any process that has a GUI) will be exposed here - ofc there won't be as many per process metrics as in the process collector, but the actual entries will be as many as there are processes using the GPU. Again, something we can discuss and work out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, then no worries! It's just been a rather frequent occurrence historically that contributors start off by copy/pasting one of the collects that generate a lot of metrics and gets this log line by chance, so to speak. From your description it sounds warranted then, I wasn't sure how many matches it would result in.


return &nvidiaCollector{
TotalProcessGpuMemoryUsed: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "total_process_gpu_memory_used"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to indicate the unit of measure in the name, so eg total_process_gpu_memory_used_bytes (I guess)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok noted, will fix.

nil,
),
TotalGpuMemoryUsed: prometheus.NewDesc(
prometheus.BuildFQName(Namespace, subsystem, "total_gpu_memory_used"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit here too

continue
}
if currentDriverModel == C.NVML_DRIVER_WDDM {
log.Warnf("Gpu %s is using WWDM driver that does not allow collecting per process memory usage\n", name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just continue then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the TotalGpuMemoryUsed, which does not care about WDDM / TCM mode. That is why the continue is not there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright!

)

func init() {
registerCollector("nvidia", newNvidiaCollector, "NVIDIA")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't seem to unmarshal the NVIDIA counters you request here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this is literally just an error, since I don't think I initially understood what that parameter does, will remove it.

@@ -2,11 +2,78 @@

package collector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the amount of code and requirement for extra libraries, I'd be inclined to suggest we put the per-process GPU metrics in a separate collector. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, well, as for the extra libs, this only uses DXGI.lib, which is present in Windows/System32. The data comes from perflib, the extra code is just to match the GPU luids with the corresponding processes, so since I perceive the process collector as something that takes data from perflib, makes sense for me that this stuff lives here, but again we can talk about it.

@@ -2,11 +2,78 @@

package collector

/*
#cgo LDFLAGS: -lDXGI
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned previously, this'd be nice to put in headers/ with a wrapper

@rubu
Copy link
Author

rubu commented Sep 19, 2021

@carlpett thanks for getting back, so let's try and resolve the points and I'll fix the stuff that needs fixing :) Since I already made some bugfixes it actually became annoying that I have to build this from source every time and redistribute the binary, if the proper way would be to use the official releases. Btw one thing I stumbled upon last week - the promu build command actually silently excluded all the cgo stuff since the yaml file was missing:

go:
    cgo: true

which I added. So up to this point windows_exporter did not anyhow depend on any C bindings?

@carlpett
Copy link
Collaborator

So up to this point windows_exporter did not anyhow depend on any C bindings?

Hm. It does, eg for the container collector (although the cgo stuff itself is in a separate library). I wonder why this would work? Could we be using different versions of promu? I believe we pinned the version quite a while ago due to some bug in Promu, there might have been upstream changes since then 😬

@okopanja
Copy link

Just a short question: is this pull request still active or not?

@rubu
Copy link
Author

rubu commented Jan 21, 2022

@okopanja I didn't manage to refactor/split all the stuff, and it seems like there is not much need for it, so I can just keep it as is in my local fork. Are you interested in it or just tidying up? In case of the former, I can try and finally make everything nice and repush, otherwise I guess this can be tidied up.

@ChandonPierre
Copy link

@okopanja I didn't manage to refactor/split all the stuff, and it seems like there is not much need for it, so I can just keep it as is in my local fork. Are you interested in it or just tidying up? In case of the former, I can try and finally make everything nice and repush, otherwise I guess this can be tidied up.

I'd love to see this PR pushed over the line

@rubu
Copy link
Author

rubu commented Jul 12, 2022

@ChandonPierre Hey, sorry for abandoning work on this - simply lack of time and it was easier to live with a local fork since the WDDM/GPU metrics were all that I needed, but can try to find time and give the refactor another go, but I can't make any promises on the timeline. Anyone else maybe interested in helping out?

"unsafe"

"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/common/log"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be "github.com/prometheus-community/windows_exporter/log"

@kczauz
Copy link

kczauz commented Apr 18, 2023

Hi @rubu - do you have some time to finish this up? I am interested in having some gpu metrics on windows endpoints.

@github-actions github-actions bot added the Stale label Nov 25, 2023
@github-actions github-actions bot closed this Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants