WDDM metrics #817

rubu · 2021-07-14T14:50:44Z

I'm opening a PR for feature that collects GPU metrics via WDDM in the perflib collector (initially the idea was to collect NVIDIA GPU metrics via NVML but that does not work with cards in WDDM mode, but that functionality is in this PR as well).

this adds a new collector named nvidia which uses the C nvml library to query and expose per process memory usage

to disable useless polling of wddm devices that will not report valid memory metrics, detect driver type and skip devices that are currently using wddm driver

* skip device if driver model cannot be determined * initialize pid memory map

* introduce total_process_gpu_memory_used (WDM only( / total_gpu_memory_used to expose per gpu and per process metrics

rubu · 2021-07-14T14:55:34Z

What is the best practice - should I squash the whole feature branch, merge it in my master, rebase it and then open a PR?

carlpett

Thanks for the PR @rubu!
There's some code-organizational parts I've mentioned that we need to talk about. Also, I suppose our CI currently lacks a lot of libraries to build this. Any idea what we need to install?

collector/nvidia.go

carlpett · 2021-09-18T19:10:12Z

collector/nvidia.go

+#include "nvml.h"
+#include <windows.h>
+
+// definition of required import symbols


We have a pattern of putting Cgo-bits in the headers/ directory, and expose a "nice" Go interface from there, so the collector logic gets easier to read

ok, i can rework that.

So, at this point the repo does not have a headers/ directory. Thus I have no idea what exactly should go in there. Can you give me a link to any other project/repo that has such a structure for an example? Sorry, I'm primarily a C++ dev and really do not know the guidelines for Go/C interop.

Hm, perhaps you need to rebase? Looks like this branch started off in February and it might not have existed then, but it does now. There's some background in this PR

carlpett · 2021-09-18T19:10:46Z

collector/nvidia.go

+	const subsystem = "nvidia"
+
+	if *processWhitelist == ".*" && *processBlacklist == "" {
+		log.Warn("No filters specified for nvidia collector. This will generate a very large number of metrics!")


Is this a copy/paste, or will there actually be a lot of metrics?

That is not a copy/paste, since my initial thought was that basically any process that is using the GPU (thus for minimum any process that has a GUI) will be exposed here - ofc there won't be as many per process metrics as in the process collector, but the actual entries will be as many as there are processes using the GPU. Again, something we can discuss and work out.

Okay, then no worries! It's just been a rather frequent occurrence historically that contributors start off by copy/pasting one of the collects that generate a lot of metrics and gets this log line by chance, so to speak. From your description it sounds warranted then, I wasn't sure how many matches it would result in.

carlpett · 2021-09-18T19:13:11Z

collector/nvidia.go

+
+	return &nvidiaCollector{
+		TotalProcessGpuMemoryUsed: prometheus.NewDesc(
+			prometheus.BuildFQName(Namespace, subsystem, "total_process_gpu_memory_used"),


It is good to indicate the unit of measure in the name, so eg total_process_gpu_memory_used_bytes (I guess)

Ok noted, will fix.

carlpett · 2021-09-18T19:13:22Z

collector/nvidia.go

+			nil,
+		),
+		TotalGpuMemoryUsed: prometheus.NewDesc(
+			prometheus.BuildFQName(Namespace, subsystem, "total_gpu_memory_used"),


Unit here too

carlpett · 2021-09-18T19:14:53Z

collector/nvidia.go

+				continue
+			}
+			if currentDriverModel == C.NVML_DRIVER_WDDM {
+				log.Warnf("Gpu %s is using WWDM driver that does not allow collecting per process memory usage\n", name)


Shouldn't we just continue then?

There is also the TotalGpuMemoryUsed, which does not care about WDDM / TCM mode. That is why the continue is not there.

carlpett · 2021-09-18T19:18:22Z

collector/nvidia.go

+)
+
+func init() {
+	registerCollector("nvidia", newNvidiaCollector, "NVIDIA")


You don't seem to unmarshal the NVIDIA counters you request here?

Sorry, this is literally just an error, since I don't think I initially understood what that parameter does, will remove it.

carlpett · 2021-09-18T19:22:20Z

collector/process.go

@@ -2,11 +2,78 @@

 package collector


Given the amount of code and requirement for extra libraries, I'd be inclined to suggest we put the per-process GPU metrics in a separate collector. What do you think?

Hmm, well, as for the extra libs, this only uses DXGI.lib, which is present in Windows/System32. The data comes from perflib, the extra code is just to match the GPU luids with the corresponding processes, so since I perceive the process collector as something that takes data from perflib, makes sense for me that this stuff lives here, but again we can talk about it.

carlpett · 2021-09-18T19:22:42Z

collector/process.go

@@ -2,11 +2,78 @@

 package collector

+/*
+#cgo LDFLAGS: -lDXGI


As mentioned previously, this'd be nice to put in headers/ with a wrapper

rubu · 2021-09-19T13:27:20Z

@carlpett thanks for getting back, so let's try and resolve the points and I'll fix the stuff that needs fixing :) Since I already made some bugfixes it actually became annoying that I have to build this from source every time and redistribute the binary, if the proper way would be to use the official releases. Btw one thing I stumbled upon last week - the promu build command actually silently excluded all the cgo stuff since the yaml file was missing:

go:
    cgo: true

which I added. So up to this point windows_exporter did not anyhow depend on any C bindings?

carlpett · 2021-09-25T15:15:06Z

So up to this point windows_exporter did not anyhow depend on any C bindings?

Hm. It does, eg for the container collector (although the cgo stuff itself is in a separate library). I wonder why this would work? Could we be using different versions of promu? I believe we pinned the version quite a while ago due to some bug in Promu, there might have been upstream changes since then 😬

okopanja · 2022-01-21T19:38:50Z

Just a short question: is this pull request still active or not?

rubu · 2022-01-21T23:01:04Z

@okopanja I didn't manage to refactor/split all the stuff, and it seems like there is not much need for it, so I can just keep it as is in my local fork. Are you interested in it or just tidying up? In case of the former, I can try and finally make everything nice and repush, otherwise I guess this can be tidied up.

ChandonPierre · 2022-07-12T13:59:17Z

@okopanja I didn't manage to refactor/split all the stuff, and it seems like there is not much need for it, so I can just keep it as is in my local fork. Are you interested in it or just tidying up? In case of the former, I can try and finally make everything nice and repush, otherwise I guess this can be tidied up.

I'd love to see this PR pushed over the line

rubu · 2022-07-12T14:05:22Z

@ChandonPierre Hey, sorry for abandoning work on this - simply lack of time and it was easier to live with a local fork since the WDDM/GPU metrics were all that I needed, but can try to find time and give the refactor another go, but I can't make any promises on the timeline. Anyone else maybe interested in helping out?

ChandonPierre · 2022-07-25T19:10:11Z

collector/nvidia.go

+	"unsafe"
+
+	"github.com/prometheus/client_golang/prometheus"
+	"github.com/prometheus/common/log"


I believe this should be "github.com/prometheus-community/windows_exporter/log"

kczauz · 2023-04-18T14:44:27Z

Hi @rubu - do you have some time to finish this up? I am interested in having some gpu metrics on windows endpoints.

rubu added 7 commits February 15, 2021 20:22

add support for monitoring nvidia gpus

144a1dc

this adds a new collector named nvidia which uses the C nvml library to query and expose per process memory usage

add wddm driver type detection

3600f58

to disable useless polling of wddm devices that will not report valid memory metrics, detect driver type and skip devices that are currently using wddm driver

bugfixes

5ab87a1

* skip device if driver model cannot be determined * initialize pid memory map

rework metrics

5809b23

* introduce total_process_gpu_memory_used (WDM only( / total_gpu_memory_used to expose per gpu and per process metrics

expose perflib gpu metrics in the process collector

1b66009

map all the wddm gpu data to process

c3b94c3

clean up the unneeded stuff

fa034f8

rubu requested a review from a team as a code owner July 14, 2021 14:50

rubu added 3 commits September 16, 2021 17:53

don't fail if GPU metrics are not available

caf76b3

fix removed variable

fb214df

the wdm and nvml features require C calls

34e4ca7

carlpett reviewed Sep 18, 2021

View reviewed changes

ChandonPierre reviewed Jul 25, 2022

View reviewed changes

github-actions bot added the Stale label Nov 25, 2023

github-actions bot closed this Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WDDM metrics #817

WDDM metrics #817

rubu commented Jul 14, 2021 •

edited

Loading

rubu commented Jul 14, 2021

carlpett left a comment

carlpett Sep 18, 2021

rubu Sep 19, 2021

rubu Sep 19, 2021

carlpett Sep 25, 2021

carlpett Sep 18, 2021

rubu Sep 19, 2021

carlpett Sep 25, 2021

carlpett Sep 18, 2021

rubu Sep 19, 2021

carlpett Sep 18, 2021

carlpett Sep 18, 2021

rubu Sep 19, 2021

carlpett Sep 25, 2021

carlpett Sep 18, 2021

rubu Sep 19, 2021

carlpett Sep 18, 2021

rubu Sep 19, 2021

carlpett Sep 18, 2021

rubu commented Sep 19, 2021 •

edited

Loading

carlpett commented Sep 25, 2021

okopanja commented Jan 21, 2022

rubu commented Jan 21, 2022

ChandonPierre commented Jul 12, 2022

rubu commented Jul 12, 2022

ChandonPierre Jul 25, 2022

kczauz commented Apr 18, 2023

WDDM metrics #817

WDDM metrics #817

Conversation

rubu commented Jul 14, 2021 • edited Loading

rubu commented Jul 14, 2021

carlpett left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rubu commented Sep 19, 2021 • edited Loading

carlpett commented Sep 25, 2021

okopanja commented Jan 21, 2022

rubu commented Jan 21, 2022

ChandonPierre commented Jul 12, 2022

rubu commented Jul 12, 2022

Choose a reason for hiding this comment

kczauz commented Apr 18, 2023

rubu commented Jul 14, 2021 •

edited

Loading

rubu commented Sep 19, 2021 •

edited

Loading