Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metrics to capture k8s throttling impact on tekton execution #6631

Closed
gabemontero opened this issue May 5, 2023 · 0 comments · Fixed by #6744
Closed

add metrics to capture k8s throttling impact on tekton execution #6631

gabemontero opened this issue May 5, 2023 · 0 comments · Fixed by #6744
Labels
area/metrics Issues related to metrics kind/feature Categorizes issue or PR as related to a new feature.

Comments

@gabemontero
Copy link
Contributor

gabemontero commented May 5, 2023

Feature request

Add metrics around when the scheduling of PipelineRuns/TaskRuns by k8s are delayed because either
a) k8s ResourceQuota / LimitRanges defined in the namespace for said PipelineRun/TaskRun prevent the work from getting scheduled
b) Node resources (like CPU or memory for example) would be over extended if the PipelineRun/TaskRun were scheduled

Keep in mind, these delays can occur both
a) during the initial scheduling as part of creating the underlying Pod
b) in the case that multiple Containers (i.e. Tekton Steps) are part of the Pod, during the scheduling of subsequent Containers after the initial Container was scheduled, k8s imposes delays

This builds upon the support already present in Tekton where Pod conditions around these situations are mapped to Tekton object conditions.

Why users would want this:
a) understanding the existence of such delays helps deal with performance concerns
b) these conditions are often transient, persisting for indeterminate amount of time, and are difficult for users to capture/recognize live/when they are happening. K8s will also create Event objects, but those are often subject to pruning. Metrics are a way to create a more permanent record
c) metrics allow for alert generation, which facilitate remediation of issues
d) metrics are more "agnostic", which is of interest to those that like to shield, or be shielded from, the underlying k8s specifics whenever possible

NOTE 1: it is possible to getting finer grained metrics around these exceed quota/node conditions, like is it limit.memory, but it involves parsing of the message. There is no API expressing that detail. Granted, these messages have remained the same for a few years with k8s, but that does not mean it could change. So this feature request does not initially ask for that. But if the discussion around this feature want this, then by all means, lets add that.

NOTE 2: Adding a label, whose value is updated on each event, or labels for each throttling event (with a counter included in the key) on the *Run object is also a possibility for providing an historical record of such throttling situation, and could include saving the human understandable messages on which computing resource was the constraint, allowing for searching and investigation after the fact. It could also allow for third party add ons that are willing to build metrics based on the content of the human readable message to do so. Again, that could be added to this feature request, or lead to a follow up request, based on the opinions/wishes of those in the community who provide input on this.

NOTE 3: I have an implementation on creating metrics based on the conditions is at the ready if the community clearly is good with pursuing this, and is thus willing to review the PR.

Use case

As a developer using a tekton installation, having available metrics to allow me to see when k8s throtting impacting my *Run performance would help me tune my Pipelines/Tasks/Namespaces.

As an operator responsible for maintaining a tekton installation, having available metrics and being able to define alerts will allow be to proactive deal with user complaints and *Run performance, and tune system resoruces, from changing policies, to adding more Nodes, to help with said performance.

@vdemeester @khrm FYI per our in-house conversation earlier this week.

@gabemontero gabemontero added the kind/feature Categorizes issue or PR as related to a new feature. label May 5, 2023
@lbernick lbernick added the area/metrics Issues related to metrics label May 9, 2023
gabemontero added a commit to gabemontero/pipeline that referenced this issue Apr 11, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
gabemontero added a commit to gabemontero/pipeline that referenced this issue Apr 12, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
gabemontero added a commit to gabemontero/pipeline that referenced this issue Apr 12, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
gabemontero added a commit to gabemontero/pipeline that referenced this issue Apr 23, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
gabemontero added a commit to gabemontero/pipeline that referenced this issue Apr 24, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
gabemontero added a commit to gabemontero/pipeline that referenced this issue Apr 30, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
gabemontero added a commit to gabemontero/pipeline that referenced this issue May 1, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
tekton-robot pushed a commit that referenced this issue May 10, 2024
Back when implementing #6744 for #6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Issues related to metrics kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
2 participants