Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible aberation with the new CPU metrics on rtc counter reset (usage % above 100%) #1299

Open
JDA88 opened this issue Sep 28, 2023 · 37 comments

Comments

@JDA88
Copy link
Contributor

JDA88 commented Sep 28, 2023

windows_exporter version: 0.23.0

I am trying to migrate our CPU usage percent metric to the new ones introduce in v0.21.0

Query 1 (Old) per core usage (limited to 1 core for the example):
((1-(avg without (mode) (rate(windows_cpu_time_total{mode="idle", core="0,0"}[1m])))) * 100)

Query 2 (New) per core usage (limited to 1 core for the example):
rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / rate(windows_cpu_processor_rtc_total{}[1m])

The issue is that on sometimes I can get values way above 100% (120%+) with the new query. I fixed it with a clamp, but it's far from ideal. (I actually just reverted the cahnge in production until I have a proper way to fix the issue)

Example with rate Query1 in yellow and Query2 in green
image

Looking at the detail of each new metrics I might have found the root cause:

  • windows_cpu_processor_utility_total is fine (It continue to increment until the computer reboots)
  • windows_cpu_processor_rtc_total on the other hand is reseted every time it reach 4300000000, and the calculation artefacts append exatly at that time.

Example bellow with windows_cpu_processor_rtc_total graphed for 2 different servers
image

@higels is this reset expected in between reboots? Am I doing something wrong?
From my understanding, with the lack of xrate we should avoid frequent counter resets

PS: Our Windows servers are mostly VMware VMs

@JDA88 JDA88 changed the title Possible aberation with the new CPU metrics (usage % above 100%) Possible aberation with the new CPU metrics on rtc counter reset (usage % above 100%) Sep 28, 2023
@higels
Copy link
Contributor

higels commented Sep 29, 2023

Yes, it is possible for the utilization number to go over 100%. I swear I had documentation for % Processor Utility counter under Processor Information stating this, but I can't seem to find it right now. For bare metal, I think the theoretical maximum is something like the ratio of TSC frequency to max boost (so 165% on a 7702P - 3.3:2.0), but you can see a similar explanation here: https://learn.microsoft.com/en-us/troubleshoot/windows-client/performance/cpu-usage-exceeds-100.

Depending on how VMWare exposes various MSRs and what happens when the VM can't be scheduled, your guess is as good as mine. All I can say is that when a system is under moderate all-core load, the old-style counters mask how far the system is from being completely utilized.

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 29, 2023

I would agree with you if the usage (with the classic counters) around the artefact time showed a spike also (capped at 100%) but:

  • All the artefacts I found always append exactly when the counter reset, I don’t believe in coincidence in IT anymore :)
  • Prometheus is known to generate wrong calculation around counter reset with rate (the community is asking for a xrate to partially solve this for quite long time)
  • Those are VMs running with the host configured with turbo boost and hyper-threading disabled
  • The difference in graph value between the classic and new counter are too great to be true imo

Even if I cap the value to 100% comparing it to the task manager on the computer the value is too different during those events and our engineer would not understand why there is a spike ever few hours that does not register elsewhere.

Dou you think there is a way to prevent the windows_cpu_processor_rtc_total metric to reset every few hours? (Can the exporter keep adding to the counter and not reset it). The issue is probably on the Prometheus rate side and not the row data themselves.

@higels
Copy link
Contributor

higels commented Sep 29, 2023

Ah, I see what you are saying - I was referring to a separate issue. I do also see this issue with our own fleet, but I'm rarely zooming in at this kind of resolution to notice.

So the actual problem is that rate(windows_cpu_processor_rtc_total[1m]) should be steady on a given CPU (or at least is on any of the windows servers I have access to), but the resets make it very choppy.

I don't want to speak for the maintainers of the project, but I think the philosophy has been to expose any data warts-and-all, avoid storing state between scrapes and to handle any unpleasantness in promql and grafana.

Assuming I'm correct, I sea two likely ways out of this:

  1. We find another source for this counter that is stored in something larger than a 32 bit int and use that instead.
  2. You smooth out the unpleasantness with max_over_time(rate(windows_cpu_processor_rtc_total{}[2m])[10m:2m] or similar. This makes the possibly bold assumption that this value should be completely stable, but I can't find any indication that it's otherwise.

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 29, 2023

I agree on the "expose the data as raw as possible" and Option 1 would really be ideal.
Unfortunately for us Option 2 is a no go as the teams want alerts ASAP on CPU spikes and smoothing does indeed fixes the issue but it also delays the alerts...

Reading again the Option 2.. Does the value is supposed to be constant even with Turbo spikes etc?

@higels
Copy link
Contributor

higels commented Sep 29, 2023

Yes, the value should be constant, so I don't think it would affect alert latency.

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 29, 2023

Ok, I'll try to POC that. Thanks for the clarification

@JDA88
Copy link
Contributor Author

JDA88 commented Nov 13, 2023

Ok, I tested the smooth with max_over_time(rate(windows_cpu_processor_rtc_total{}[2m])[10m:2m]) and I'm not really satisfied.
Both rate(windows_cpu_processor_rtc_total[1m]) and max_over_time(rate(windows_cpu_processor_rtc_total{}[2m])[10m:2m]) have a variation (max_over_time been much smaller but it fluctuate).

Example here with (on a period without reset) :

max_over_time(rate(windows_cpu_processor_rtc_total{instance="host", core="0,0"}[1m])[10m:1m]) - 
rate(windows_cpu_processor_rtc_total{instance="host",core="0,0"}[1m])

image

or
max_over_time(rate(windows_cpu_processor_rtc_total{instance="host", core="0,0"}[5m])[10m:5m])
image

If rate(windows_cpu_processor_rtc_total[1m]) is supposed to return a const it should be presented by the exporter as a const and not a _total that you have to rate. It introduce variations that, even if minimal, impact the cpu % usage computation.

@gonchik
Copy link

gonchik commented Jul 8, 2024

@JDA88 finally , how did finalized that graphs?

@JDA88
Copy link
Contributor Author

JDA88 commented Jul 8, 2024

@JDA88 finally , how did finalized that graphs?

I never managed to use those new metrics. As stated above I suppose windows_cpu_processor_rtc_total should be a const and the fact that it is not make the calculation wrong from time to time.

@gonchik
Copy link

gonchik commented Jul 8, 2024

@breed808 Thanks for your response. You saved a lot of time to me :)

will review with existing counters.

@jkroepke
Copy link
Member

Hi all,

I'm currently asking myself, how windows_exporter could solve this issue here?

Reading https://www.cpuid.com/softwares/hwmonitor.html, it's expected that Windows can report more than 100% CPU.
There is dedicated article for that case: https://learn.microsoft.com/en-us/troubleshoot/windows-client/performance/cpu-usage-exceeds-100

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 12, 2024

it's expected that Windows can report more than 100% CPU

True, but the Task Manager cap the usage to 100% and the issue here is that the windows_cpu_processor_rtc_total is a counter that reset every time it reach 4300000000, and the calculation artefacts append exatly at that time.

The value should be on another metric and be a constant, not a counter that way you can replace:
rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / rate(windows_cpu_processor_rtc_total{}[1m])
with
rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / (windows_cpu_processor_rtc_value{})

The only issue I can see is if the rate(windows_cpu_processor_rtc_total{}[1m]) is not expected to be const. Then we are dependant to having Prometheus fix the calculation artefact with rate on counters

@jkroepke
Copy link
Member

jkroepke commented Sep 12, 2024

Task Manager cap the usage to 100%

Then, let do the same on the prometheus queries. Use the min function to cap the it, like

min(100, rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / rate(windows_cpu_processor_rtc_total{}[1m]))

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 12, 2024

There is a misunderstanding on the issue.
The problem is not that with those counters you sometime have spikes above 100%, the problem is that you get artefacts only because you the need to use rate to evaluate a value that is probably a constant.

The graph bellow show it perfectly (in yellow) the reality on the server (const CPU usage around 80%) and with those new metrics and the issue with rate (in green) you see a spike to 100%+ witch is NOT real as the test process is limited to 80% usage on the server.

image

@jkroepke
Copy link
Member

jkroepke commented Sep 12, 2024

I feel that this is extract behavior, that Microsoft and other tools describe as known issue.

In the FAQ of HWMon, they describe the issue appear after using the utility counters instead of the time counters.

Using the utility based counters, it's expected to gain a result more than 100%. Time based counter doesn't not have that issue.

Thats the reason, why windows_cpu_time_total does not have that issue while windows_cpu_processor_utility_total has it.

the problem is that you get artefacts only because you the need to use rate to evaluate a value that is probably a constant.

I have the feeling that the windows_cpu_processor_utility_total counter reports some gaps in increase which leads to that situation.

If windows_exporter would use the calculated values offered by windows, than the 100% would exceeded as well.

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 12, 2024

For us, having regular spikes (even caped at 100%) that does not exist in reality make the counter useless, at least to compute % usage

@jkroepke
Copy link
Member

@JDA88 If I understand correctly, the counter reset is the issue?

For the 0.30er release, I plan to offer an option to use the Performance Counter API (pdh.dll) to collect the values. (#1350)

Currently, the windows_exporter manually scrapes the data from the Windows Registry directly. I would like to look back to this issue, once the new option is availible. Just to exclude that there is an overflow issue something at the old code. The code is directly reading binary data.


What really help to get deeper into this issue would be a perf counter log.

As I know, the builtin Performance Monitor tool can monitor und track the counter values. Than, we can identify, if Performance Monitor shows similar peaks, too. We could also compare, if the values reported by windows_exporter are equal to windows tools.

My knowledge is limited here, but I found that: https://serverfault.com/questions/1012544/how-can-i-monitor-cpu-usage-and-processes-on-windows-server-2012-as-a-service-ov

that does not exist in reality make the counter useless, at least to compute % usage

Just one of few things at Windows, they are useless.

@higels
Copy link
Contributor

higels commented Sep 12, 2024

To be clear: we don't even know what those arbitrary bytes in the registry really are supposed to represent; that it is the RTC was just an educated guess, eitherway, the width of the integer being used is very narrow.

If we can cleanly export the value as a constant (which it should be on any hardware I've tested), that would be ideal; I just felt the correct approach was to take what Windows gave us, garbage or otherwise!

If I use the max_over_time trick with a 3h backward look, it is acceptable for my usage at least (the RTC counter rolled over at 12:08 and 10:14 and 08:19)

image

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 13, 2024

@JDA88 If I understand correctly, the counter reset is the issue?

Yes.

What really help to get deeper into this issue would be a perf counter log.

I'm in for anything that can help find a use to those metrics.

If we can cleanly export the value as a constant (which it should be on any hardware I've tested), that would be ideal

Agree :)

@jkroepke
Copy link
Member

If I understand, that this report is different from the error below, where I got constantly more than 100%:

image

@higels
Copy link
Contributor

higels commented Sep 13, 2024

If I understand, that this report is different from the error below, where I got constantly more than 100%:

Yes.

My naive guess is that Windows attempts to calculate 100% CPU to mean "would be 100% utilised if the CPU were fully throttled to its TSC frequency", so if you are running at boost and fully utilising the CPU, it will show > 100%.

I have seen it be as high as 160% on a single hyperthread (where TSC is 2.0GHz and max boost is 3.2GHz).

I could be wrong. Hyperthreads complicate this as well.

We've been using these numbers for capacity planning and tuning since I wrote the collector 2ish years ago. @JDA88 has more precise alerting requirements which is why the resets were noticed.

@jkroepke
Copy link
Member

jkroepke commented Sep 13, 2024

image

Here is an another screenshot from a local test.

  • green is the current rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / rate(windows_cpu_processor_rtc_total{}[1m])
  • blue is the value, calculated by windows performance counter (using formatted values)

At least they look similar which I interpret that the prometheus function looks good-ish in general.

I will take a look on the counter reset, but for me it's hard to reproduce the issue.

I guess, the only was to solve this kind of issue would be switch from Raw performance counters to formatted.

@jkroepke
Copy link
Member

Some additional insight I would like to share.

I figure out that the RTC metric based on the "second value" of the (Processor Information\% Processor Utility counter)

Running (Get-Counter "\Prozessorinformationen(0,0)\Prozessorzeit (%)").CounterSamples | Format-List -Property * in powershell return a very large number which is different from windows_exporter and I have no clue why

(Get-Counter "\Prozessorinformationen(0,0)\Prozessorzeit (%)").CounterSamples | Format-List -Property *


Path             : \\jok-pc\prozessorinformationen(0,0)\prozessorzeit (%)
InstanceName     : 0,0
CookedValue      : 5,87935908793359
RawValue         : 8779062500
SecondValue      : 133709939437918651
MultipleCount    : 1
CounterType      : Timer100NsInverse
Timestamp        : 16.09.2024 22:59:03
Timestamp100NSec : 133710011437918651
Status           : 0
DefaultScale     : 0
TimeBase         : 10000000

In that case, SecondValue is the number of 100-nanosecond intervals that have elapsed since 12:00 A.M. January 1, 1601 Coordinated Universal Time (UTC) which is not resetting value.

A deeper dive is mandatory to indicate the root cause here.

@higels
Copy link
Contributor

higels commented Sep 16, 2024

My understanding is that you are concerned that "Processor Utility" and "Processor Time" show different values. This is expected and reasonably well understood. IIRC the former was added in Server 2016 / Windows 10, the latter dates back to the days of fixed clockspeeds and no hyperthreading.

See #787

@jkroepke
Copy link
Member

jkroepke commented Sep 16, 2024

Thanks for clarification. My system language is german and performance counter are localized in Windows. I grab to wrong one. 😅

@higels
Copy link
Contributor

higels commented Sep 16, 2024

Kein Problem!

@jkroepke
Copy link
Member

If a const would be a workaround, what about:

rate(windows_cpu_processor_utility_total[1m]) / 625000

I run the following query

round(rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / 625000, 0.1) == round(rate(windows_cpu_processor_utility_total{core="0,0"}[1m]) / rate(windows_cpu_processor_rtc_total[2m]), 0.1)

to check that the difference, using round with 1 decimal after point.

The gaps (where both query are not equal) are minimal, I wound say okay-ish.

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 20, 2024

Is it ok to have a metric that need a rate and a division with a number to be used correctly? And are we 100% sure the value is the same across all computers?

Unfortunately, we don’t have lots of hardware variety here so I can’t really test that.

@jkroepke
Copy link
Member

And are we 100% sure the value is the same across all computers?

We are 100% sure, that the value is not the same across all computers. There is a diff between +/- 100 diff around 625000.

In total I have that feeling that % Processor Utility may only make sense on a 1 second poll rate. Otherwise, the error here appears.

On base value reset, the Powershell command Get-Counter fails, because the difference between 2 collects is negative as well.

After endless efforts here, it Performance Counter seems unusable here.

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 20, 2024

On base value reset, the Powershell command Get-Counter fails, because the difference between 2 collects is negative as well.

Damn. Sorry but in this current state I don’t see any use of those metrics for our usages.
I was excited about the possibility of a more precise metric that the original one but it0’s not the case. They might be other scenarios where they are better though.

@jkroepke
Copy link
Member

💯 agree.

I try to solve this, but I can't solve design issues here.

@higels
Copy link
Contributor

higels commented Sep 20, 2024

Here's some random samples across our fleet. Seems to be invariant excluding sample errors:

ceil(avg by (name) (avg by (host) (rate(windows_cpu_processor_rtc_total[2m])) * on (host) group_right windows_cpu_info))

{name="AMD EPYC 9534 64-Core Processor"} 643381
{name="AMD EPYC 7702P 64-Core Processor"} 624986
{name="AMD Ryzen 9 3900X 12-Core Processor"} 624999
{name="AMD EPYC 7402P 24-Core Processor"} 624997
{name="AMD Ryzen 9 5900X 12-Core Processor"} 624984
{name="Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz"} 610371
{name="AMD EPYC 7232P 8-Core Processor"} 624984
{name="AMD EPYC 7713P 64-Core Processor"} 624967
{name="Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz"} 624973
{name="Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz"} 624980
{name="AMD EPYC 7343 16-Core Processor"} 625035
{name="Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz"} 894871
{name="Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz"} 625009
{name="Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz"} 625003
{name="Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz"} 535830

So definitely not the same across all systems. Note that the old E5-2667 v2 (hello 2014) is nearly 900K.

If we really want to solve this by exporting it as as gauge, we should do a read at initialization until we get a positive number, and then export as a constant thereafter.

@higels
Copy link
Contributor

higels commented Sep 20, 2024

Thinking about it some more, if we're going to go that far, we may as well divide the utility metrics by that constant * 100 since I think that's the only purpose it serves, so my "cpu idle" query would go from:

1 - avg(rate(windows_cpu_processor_utility_total{}[$__rate_interval])
/
(1e2 * max_over_time(rate(windows_cpu_processor_rtc_total{}[$__rate_interval])[3h:$__rate_interval])))

to just

1 - avg(rate(windows_cpu_processor_utility_total{}[$__rate_interval])

but we'd presumably have to rename the metrics.

@JDA88
Copy link
Contributor Author

JDA88 commented Sep 20, 2024

The metric query would look much more "intuitive" that way, that for sure!

@jkroepke
Copy link
Member

If we really want to solve this by exporting it as as gauge, we should do a read at initialization until we get a positive number, and then export as a constant thereafter.

On my system, rate(windows_cpu_processor_rtc_total[1m]) doesn't produce a stable rate.

image

I mean, we can do such shenanigans, but I have the feeling that at the end, doing it that way results in less precise value than just using windows_cpu_time_total metric.

@jkroepke
Copy link
Member

jkroepke commented Sep 20, 2024

I got some additional info, how the counter works and why the counter doesn't work in the Prometheus eco system:

The rollover is an integer flow (32bit unsigned int) and the rollover happens on 4294967295 (2^32).

In some languages without integer overflow protection (like C), the integer substraction is stable. It just works including the rollover. (go ref: https://go.dev/play/p/57GXzfT-m5G)

For example:

  • C0 = ((2^32)-10)
  • C1 = 10

C1 - C0 = 20

The problem here is that all Prometheus values are float64. By converting the values, the bit based mechanic is lost.

@higels
Copy link
Contributor

higels commented Sep 20, 2024

The metric query would look much more "intuitive" that way, that for sure!

I'll see if I can get some time to mock this up next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants