-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust dropped metrics from cAdvisor #1396
Adjust dropped metrics from cAdvisor #1396
Conversation
lgtm |
This change drops pod-centric metrics without a non-empty 'container' label. Previously we dropped pod-centric metrics without a (pod, namespace) label set however these can be critical for debugging.
6eb9554
to
56f96e6
Compare
lgtm, thanks (in general io behaves like cpu / mem, whereas the others are truly container specific) |
@philipgough, aren't those discards metrics useful to debug statefulesets? or pods using host path volumes? are these metrics available elsewhere? |
@mindw my understanding was that the dropped |
@philipgough here is an example: |
Hi @mindw thank you for raising your concerns. To give you more context about what we are trying to achieve here, we had previously discarded some of these metrics as part of an effort to reduce the number of series exposed by the kubelet. So we had identified some container-centric metrics for which pod-level information wasn't useful. |
@dgrisonnet thank you for the detailed explanation! My apologies, but I'm not sure how the container level metrics make less sense than pod level - especially with multiple side cars. Also, how does one get those metrics - see: |
@mindw - When I deployed the change to an OpenShift cluster I query a pod in a stateful set and retain the info. So trying to figure out whats different than your setup. Can you provide info on the following:
|
@philipgough thank you for looking into this :)
|
Container level metrics don't completely represent the metrics in a pod - a slice is the sum of all containers plus any work directly handled in the pod slice, so you can't get the total resources accounted to a pod by summing containers. All that being said, the primary cadvisor metrics are cpu, memory, disk io, and network io (although network is unreliable in most network implementations). Did we drop container level CPU and memory metrics (I had assumed that was part of the change, but looking at it now it does not look like it)? If we did not, removing container level io read/write operations/bytes might indeed impact regular users (that's Regarding use of container_name vs container, that looks like a legacydocker/CRI inconsistency. I'm kind of surprised if that is so, but it may be a known bug (would need to check with cadvisor in forum-node) that they did not change because it would break users. In general, cadvisor should be returning the same results for different containers because all implementations of the container runtime should be using cgroups in very similar ways. |
Regarding the pod level metrics being zero, I believe that's a problem in either Docker or the distro configuration with docker. Pod cgroups are supposed to sum container cgroups, and it's possible this is a docker specific bug. You'd need to file a separate bug (probably with the distro) if this is an official configuration, or kubernetes if not. Usually bugs like this are due to cgroup inconsistencies between different linux versions, or a missing configuration delivered to cgroups. For the short run, including container_fs_* for container level metrics is one option for working around this, but what you are describing is incorrect behavior of your linux distro + container runtime + kubelet. |
Based on @smarterclayton evaluation, I did some investigation locally (minikube, MacOS), reverting the drop rule to compare the output. I'm not sure if the fact that this is tested on OS X will have any bearing. Here is what we get on OpenShift, and it gives us what we expected to be available after this change was made: Checking locally, with docker runtime on minikube, my behaviour matches that reported by @mindw Changing the driver to cri-o locally produces the following: Here we notice that we get no 0 value at all, the series without the Finally with |
@simonpasquier , @dgrisonnet based on the explanation by @smarterclayton I believe this is probably a bug "somewhere" but given that the behaviour is nondeterministic at this point, I think we should revert the drop of |
This change drops "pod-centric" metrics without the 'container' label.
Previously we dropped Pod centric metrics without a (pod, namespace) label set
however these can be critical for debugging.
Description
Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request.
If it fixes a bug or resolves a feature request, be sure to link to that issue.
Type of change
What type of changes does your code introduce to the kube-prometheus? Put an
x
in the box that apply.CHANGE
(fix or feature that would cause existing functionality to not work as expected)FEATURE
(non-breaking change which adds functionality)BUGFIX
(non-breaking change which fixes an issue)ENHANCEMENT
(non-breaking change which improves existing functionality)NONE
(if none of the other choices apply. Example, tooling, build system, CI, docs, etc.)Changelog entry
Please put a one-line changelog entry below. Later this will be copied to the changelog file.