Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

namedprocess_namegroup_context_switches_total counter is decreasing #193

Open
ngosang opened this issue May 27, 2021 · 1 comment
Open

Comments

@ngosang
Copy link

ngosang commented May 27, 2021

v0.7.5
The metric namedprocess_namegroup_context_switches_total is declared ad counter as it should be. Most of the time the value increases but not always. This was causing me a lot of issues.

In this image you can see how the value increases and decreases. I think this only happens in some processes with many context switchers. In this case I'm able to reproduce in 2 Mono apps in Linux (Sonarr and Radarr).
image

When I apply the ratefunction the graph is a mess due to negative values in the vector.
image

By now I fixed it using deriv function instead of rate. This graph is mostly accurate.
image

How are you getting the context switches? How it's possible that value decreases? How can I help?

@lawsontyler
Copy link

I'm also seeing this issue, I assume it's because the exporter is doing a straight sum() of all the matching processes without any history.

For example, let's assume we have a process that accepts network connections. The main process spawns 2 sub-processes. Each subprocess will handle 1000 requests and then terminate itself, causing the main process to spawn new processes to replace it.

In the beginning you might have 3 PIDs: 10, 20, 30. At a time, T0, they all start at 0 context switches.

@ T1
PID 10 - 100 switches
PID 20 -  10 switches
PID 30 -  10 switches
SUM    = 120 switches

@ T2
PID 10 -  150 switches
PID 20 - 1000 switches
PID 30 - 2000 switches
SUM    = 3150 switches
...etc.

Now, what happens when one of the processes die and is replaced?

@ TN
PID 10 -  160 switches
PID 20 - 1200 switches
PID 40 - 0 switches
SUM    = 1180 switches

Oops...the number of context switches went down!

This has produced an interesting result for us, where it looks like the context switching is constantly accelerating for our long-running processes, since PID 10 constantly increasing and the rate() function in Prometheus thinks that it's resetting all the time.

I'm not sure how this should be solved, however - adding the PID would generate high-cardinality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants