add metrics to capture k8s throttling impact on tekton execution #6631

gabemontero · 2023-05-05T19:16:11Z

Feature request

Add metrics around when the scheduling of PipelineRuns/TaskRuns by k8s are delayed because either
a) k8s ResourceQuota / LimitRanges defined in the namespace for said PipelineRun/TaskRun prevent the work from getting scheduled
b) Node resources (like CPU or memory for example) would be over extended if the PipelineRun/TaskRun were scheduled

Keep in mind, these delays can occur both
a) during the initial scheduling as part of creating the underlying Pod
b) in the case that multiple Containers (i.e. Tekton Steps) are part of the Pod, during the scheduling of subsequent Containers after the initial Container was scheduled, k8s imposes delays

This builds upon the support already present in Tekton where Pod conditions around these situations are mapped to Tekton object conditions.

Why users would want this:
a) understanding the existence of such delays helps deal with performance concerns
b) these conditions are often transient, persisting for indeterminate amount of time, and are difficult for users to capture/recognize live/when they are happening. K8s will also create Event objects, but those are often subject to pruning. Metrics are a way to create a more permanent record
c) metrics allow for alert generation, which facilitate remediation of issues
d) metrics are more "agnostic", which is of interest to those that like to shield, or be shielded from, the underlying k8s specifics whenever possible

NOTE 1: it is possible to getting finer grained metrics around these exceed quota/node conditions, like is it limit.memory, but it involves parsing of the message. There is no API expressing that detail. Granted, these messages have remained the same for a few years with k8s, but that does not mean it could change. So this feature request does not initially ask for that. But if the discussion around this feature want this, then by all means, lets add that.

NOTE 2: Adding a label, whose value is updated on each event, or labels for each throttling event (with a counter included in the key) on the *Run object is also a possibility for providing an historical record of such throttling situation, and could include saving the human understandable messages on which computing resource was the constraint, allowing for searching and investigation after the fact. It could also allow for third party add ons that are willing to build metrics based on the content of the human readable message to do so. Again, that could be added to this feature request, or lead to a follow up request, based on the opinions/wishes of those in the community who provide input on this.

NOTE 3: I have an implementation on creating metrics based on the conditions is at the ready if the community clearly is good with pursuing this, and is thus willing to review the PR.

Use case

As a developer using a tekton installation, having available metrics to allow me to see when k8s throtting impacting my *Run performance would help me tune my Pipelines/Tasks/Namespaces.

As an operator responsible for maintaining a tekton installation, having available metrics and being able to define alerts will allow be to proactive deal with user complaints and *Run performance, and tune system resoruces, from changing policies, to adding more Nodes, to help with said performance.

@vdemeester @khrm FYI per our in-house conversation earlier this week.

Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.

Back when implementing #6744 for #6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.

gabemontero added the kind/feature Categorizes issue or PR as related to a new feature. label May 5, 2023

lbernick added the area/metrics Issues related to metrics label May 9, 2023

gabemontero mentioned this issue May 30, 2023

add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints #6744

Merged

1 task

tekton-robot closed this as completed in #6744 Jun 5, 2023

This was referenced Apr 12, 2024

add namespace label/tag to throttling metrics #7878

Closed

add namespace label/tag to non-deprecated throttle metrics #7879

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metrics to capture k8s throttling impact on tekton execution #6631

add metrics to capture k8s throttling impact on tekton execution #6631

gabemontero commented May 5, 2023 •

edited

Loading

add metrics to capture k8s throttling impact on tekton execution #6631

add metrics to capture k8s throttling impact on tekton execution #6631

Comments

gabemontero commented May 5, 2023 • edited Loading

Feature request

Use case

gabemontero commented May 5, 2023 •

edited

Loading