add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints #6744

gabemontero · 2023-05-30T17:31:32Z

Changes

/kind feature

This commit adds a new experimental gauge metrics that counts the number of TaskRuns whose
underlying Pods are currently not scheduled to run by Kubernetes:

one metric counts when Kubernetes ResourceQuota policies within a Namespace prevent scheduling
a second metric counts when underlying Node level CPU or Memory utilization are such that the underlying Pod
cannot be scheduled

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

[ /] Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
[ /] Has Tests included if any functionality added or changed
[/ ] Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
[/ ] Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
[/ ] Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
[n/a ] Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

A new gauge metric for both PipelineRun and TaskRun will indicate whether underlying Pods are being throttled by Kubernetes because of either ResourceQuota policies defined in the namespace, or because the underlying node is experiencing resource constraints.

@vdemeester @khrm ptal

tekton-robot · 2023-05-30T17:37:57Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	83.3%	3.3
pkg/taskrunmetrics/metrics.go	83.8%	85.0%	1.2

tekton-robot · 2023-05-30T17:40:10Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	83.3%	3.3
pkg/taskrunmetrics/metrics.go	83.8%	85.0%	1.2

khrm · 2023-05-30T17:42:49Z

pkg/pipelinerunmetrics/metrics.go

+						case pod.ReasonExceededResourceQuota:
+							prsThrottledByQuota++
+							// only count once per pipelinerun
+							exitForLoop = true


Can we use break instead of exitForLoop?

My recollection was that break only exited one for loop, and not both

The intent was to avoid bumping the counter for a pipelinerun multiple times because multiple taskruns are throttled.

Hence I want to leave the for _, pr :=range prs for loop as well.

Given that context, are you OK then with leaving this as is @khrm ? Or do you think I'm missing something.

wait if what I 'm saying is the case, I'm not breaking out of the last loop

yeah let me revisit @khrm .... either I break out of the last loop as well, or I just break here.

I'll update after I have fully revisited - thanks

OK @khrm - while I did not add a break exactly here as you mentioned, you flagging this made me realize what I had previously pushed was not quite what I intended, which is:

if a pipelinerun is both throttled on resource quota and node constraints with one of its underlying taskruns, bump the count once for resource quota and node constraint

if a pipelinerun has multiple taskruns throttled by either resource quota or node constraints, don't bump the counts for each taskruns, but just once for the pipelinerun

aside from the code changes (2 booleans vs. one essentially for tracking this), I augmented the unit tests to create more than one taskrun per pipelinerun

PTAL

Yes, the current logic looks good to me.

khrm · 2023-05-30T17:45:24Z

pkg/pipelinerunmetrics/metrics.go

+						case pod.ReasonExceededNodeResources:
+							prsThrottledByNode++
+							// only count once per pipelinerun
+							exitForLoop = true


Same here: Can we use break instead of exitForLoop?

see https://github.com/tektoncd/pipeline/pull/6744/files#r1210620661

tekton-robot · 2023-05-30T18:54:14Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	83.7%	3.7
pkg/taskrunmetrics/metrics.go	83.8%	85.0%	1.2

tekton-robot · 2023-05-30T18:58:49Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/pipelinerunmetrics/metrics.go	80.0%	83.7%	3.7
pkg/taskrunmetrics/metrics.go	83.8%	85.0%	1.2

khrm

/approve

khrm · 2023-05-31T07:17:57Z

/assign @vdemeester

vdemeester

/cc @lbernick @JeromeJu @afrittoli

tekton-robot · 2023-05-31T14:33:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khrm, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

khrm · 2023-05-31T16:00:40Z

Need to rebase this with the latest? @gabemontero

gabemontero · 2023-05-31T16:19:05Z

Need to rebase this with the latest? @gabemontero

yeah I saw tide citing that earlier today @khrm but now it just says it needs the lgtm label

lbernick · 2023-06-01T13:22:04Z

docs/metrics.md

 | `tekton_pipelines_controller_taskrun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `status`=&lt;status&gt; <br> `*task`=&lt;task_name&gt; <br> `*taskrun`=&lt;taskrun_name&gt;<br> `namespace`=&lt;pipelineruns-taskruns-namespace&gt; | experimental |
 | `tekton_pipelines_controller_taskrun_count` | Counter | `status`=&lt;status&gt; | experimental |
 | `tekton_pipelines_controller_running_taskruns_count` | Gauge | | experimental |
+| `tekton_pipelines_controller_running_taskruns_throttled_by_node_count` | Gauge | | experimental |


should this be quota_count?

yep will update in next push - thanks

lbernick · 2023-06-01T15:30:44Z

pkg/pipelinerunmetrics/metrics.go

 	if err != nil {
 		return fmt.Errorf("failed to list pipelineruns while generating metrics : %w", err)
 	}

 	var runningPRs int
+	var prsThrottledByQuota int
+	var prsThrottledByNode int
 	for _, pr := range prs {
 		if !pr.IsDone() {


can this if statement be removed?

if in conjunction I move the runningPRs++ to after my

if pr.IsDone() { continue }

block then yes. In hindsight, I was leaning toward preserving existing code path when possible.

I'll update in the nex push, thanls.

assuming we keep the pipelinerun version of the metric re: our discussion starting at https://github.com/tektoncd/pipeline/pull/6744/files#r1213456583

lbernick · 2023-06-01T17:12:55Z

pkg/taskrunmetrics/metrics_test.go

+	ctx, _ := ttesting.SetupFakeContext(t)
+	informer := faketaskruninformer.Get(ctx)
+	// Add N randomly-named TaskRuns with differently-succeeded statuses.
+	for _, tr := range []*v1beta1.TaskRun{


would it be possible to update this test to use table-driven testing instead?

lbernick · 2023-06-01T17:13:22Z

pkg/taskrunmetrics/metrics.go

@@ -344,17 +369,33 @@ func (r *Recorder) RunningTaskRuns(ctx context.Context, lister listers.TaskRunLi
 	}

 	var runningTrs int
+	var trsThrottledByQuota int
+	var trsThrottledByNode int
 	for _, pr := range trs {
 		if !pr.IsDone() {


can this if statement be removed?

yep similar to the other pipelinerun discussion thread

lbernick · 2023-06-01T17:15:16Z

pkg/pipelinerunmetrics/metrics_test.go

+			for _, tr := range trs {
+				// NOTE: while GetIndexer().Add(obj) works well in these unit tests for List(), for
+				// trLister.TaskRuns(trNamespace).Get(trName) to work correctly we need to use GetStore() instead.
+				// see AddToInformer in test/controller.go for an analogous solution


why not use SeedTestData or AddToInformer from test/controller.go instead? (here and in the taskrun tests)

Yeah I was trying to stay as close as possible to the patterns previously established in this test, like what I saw in TestRecordRunningPipelineRunsCount

But sure, I can tinker with say using SeedTestData in my new test here and see where that lands.

lbernick · 2023-06-01T17:15:31Z

pkg/pipelinerunmetrics/metrics_test.go

+	reasons := []string{"", pod.ReasonExceededResourceQuota, pod.ReasonExceededNodeResources}
+
+	// Add N randomly-named PipelineRuns with differently-succeeded statuses.
+	for _, tr := range []*v1beta1.PipelineRun{


could this use table-driven testing instead?

similar rationale to https://github.com/tektoncd/pipeline/pull/6744/files#r1214733673 but sure I'll see about factoring that in as well.

lbernick · 2023-06-01T17:16:52Z

pkg/pipelinerunmetrics/metrics.go

@@ -59,6 +61,16 @@ var (
 		"Number of pipelineruns executing currently",
 		stats.UnitDimensionless)
 	runningPRsCountView *view.View
+
+	runningPRsThrottledByQuotaCount = stats.Float64("running_pipelineruns_throttled_by_quota_count",


I'm just curious, how useful is it to have separate pipelinerun metrics for this when they are derived from the taskruns?

Fair question. Yeah, in theory, any of the existing total count/running count/duration metrics could be derived from their taskrun children as well, even if their current implementations do not need to venture into pr.Status.ChildrenReferences.

Argument for: it is an aggregation convenience, especially with dynamically generated pipelinerun names
Argument against: perhaps you can sort out aggregation, regardless of dynamically generated pipelinerun names ?? ... not sure I am 100% confident of the answer there in the context of how knative exports metrics to things like prometheus

If you feel strongly that it is redundant @lbernick I'm fine with pruning this PR to just the taskrun based metric, at least for the initial drop, and then we can get some real world feedback.

WDYT?

My intuition is to stay consistent w/ existing pod related metrics (we provide tekton_pipelines_controller_taskruns_pod_latency but not tekton_pipelines_controller_pipelineruns_pod_latency) and in favor of a smaller more focused PR; however if you feel it's helpful/convenient to have this at the pipelineRun level that's fine too! especially considering it's an experimental metric, we could always decide to remove it later if it's not helpful.

no that comparison with pod latency resonates @lbernick

I'll remove the pipelinerun related metric

thanks

lbernick · 2023-06-01T17:17:38Z

Thanks for your PR @gabemontero! would you mind including a commit message/PR description with a quick summary of the changes and why you're making them?

…rce quotas or k8s live node constraints This commit adds a new experimental gauge metrics that counts the number of TaskRuns whose underlying Pods are currently not scheduled to run by Kubernetes: - one metric counts when Kubernetes ResourceQuota policies within a Namespace prevent scheduling - a second metric counts when underlying Node level CPU or Memory utilization are such that the underlying Pod cannot be scheduled

tekton-robot · 2023-06-02T20:40:12Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/taskrunmetrics/metrics.go	83.8%	84.9%	1.1

gabemontero · 2023-06-02T20:40:47Z

Thanks for your PR @gabemontero! would you mind including a commit message/PR description with a quick summary of the changes and why you're making them?

OK @lbernick I believe I've pushed changes that address all the in-line code discussion, and I've attempted to address your commit message / PR description point ^^, though I'm not 100% sure I covered everything you might have expected with the description / summary, so please advise if I missed a nuance there.

PTAL and thanks :-)

tekton-robot · 2023-06-02T20:41:37Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/taskrunmetrics/metrics.go	83.8%	84.9%	1.1

tekton-robot · 2023-06-02T20:44:04Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/taskrunmetrics/metrics.go	83.8%	84.9%	1.1

lbernick

/lgtm

gabemontero · 2023-06-05T17:13:03Z

hey @vdemeester - WDYT about cherry picking back to 0.47.x ?

vdemeester · 2023-06-05T17:14:20Z

I think it does make sense yes. @lbernick @afrittoli @jerop SGTY ?

lbernick · 2023-06-05T18:52:59Z

I'm totally fine with backporting this commit, just want to make sure we don't have the expectation that new features in general will be backported, just bug fixes.

/cherry-pick release-v0.47.x

tekton-robot · 2023-06-05T18:53:56Z

@lbernick: new pull request created: #6774

In response to this:

I'm totally fine with backporting this commit, just want to make sure we don't have the expectation that new features in general will be backported, just bug fixes.

/cherry-pick release-v0.47.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

gabemontero · 2023-06-05T19:10:30Z

I'm totally fine with backporting this commit, just want to make sure we don't have the expectation that new features in general will be backported, just bug fixes.

yep agreed @lbernick - the relative small size of the change in the end was the only rationale that lead me to even asking

thanks for agreeing

/cherry-pick release-v0.47.x

khrm · 2023-06-06T08:05:54Z

I think metrics on pipelinerun also made sense. Because there's no 1-1 relation between blocked taskruns and pipelineruns. A pipelinerun multiple taskruns could be blocked, isn't it so? So, pipelinerun metric reflected blocked pipelinruns which could have multiple taskruns blocked.

Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

Back when implementing tektoncd#6744 for tektoncd#6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.

Back when implementing #6744 for #6631 we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were in could have some diagnostic value. Now that we have been using the metric added for a bit, this realization is now very apparent. This changes introduces the namespace tag. Also, since last touching this space, the original metric was deprecated and a new one with a shorter name was added. This change only updates the non-deprecated metric with the new label. Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.

tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 30, 2023

tekton-robot requested review from dibyom and lbernick May 30, 2023 17:31

khrm reviewed May 30, 2023

View reviewed changes

gabemontero force-pushed the prototype-throttle-metrics branch from fbe798d to 2a4b0f2 Compare May 30, 2023 18:48

lbernick removed their request for review May 30, 2023 18:56

khrm reviewed May 31, 2023

View reviewed changes

tekton-robot assigned vdemeester May 31, 2023

vdemeester approved these changes May 31, 2023

View reviewed changes

tekton-robot requested review from afrittoli, JeromeJu and lbernick May 31, 2023 14:33

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2023

lbernick reviewed Jun 1, 2023

View reviewed changes

gabemontero changed the title ~~add pipelinerun/taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints~~ add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints Jun 2, 2023

gabemontero force-pushed the prototype-throttle-metrics branch from 2a4b0f2 to 7c3dc5e Compare June 2, 2023 20:31

gabemontero force-pushed the prototype-throttle-metrics branch from 7c3dc5e to 93efb3c Compare June 2, 2023 20:36

lbernick reviewed Jun 5, 2023

View reviewed changes

tekton-robot assigned lbernick Jun 5, 2023

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2023

tekton-robot merged commit 4924b51 into tektoncd:main Jun 5, 2023

tekton-robot mentioned this pull request Jun 5, 2023

[release-v0.47.x] add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints #6774

Merged

gabemontero deleted the prototype-throttle-metrics branch June 5, 2023 19:27

gabemontero mentioned this pull request Sep 6, 2023

Add taskrun/pipelinerun gauge metrics around resolving respective tasks/pipelines #7094

Merged

1 task

This was referenced Apr 12, 2024

add namespace label/tag to throttling metrics #7878

Closed

add namespace label/tag to non-deprecated throttle metrics #7879

Merged

add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints #6744

add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints #6744

Conversation

gabemontero commented May 30, 2023 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented May 30, 2023

tekton-robot commented May 30, 2023

khrm May 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented May 30, 2023

tekton-robot commented May 30, 2023

khrm left a comment

Choose a reason for hiding this comment

khrm commented May 31, 2023

vdemeester left a comment

Choose a reason for hiding this comment

tekton-robot commented May 31, 2023

khrm commented May 31, 2023

gabemontero commented May 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbernick commented Jun 1, 2023

tekton-robot commented Jun 2, 2023

gabemontero commented Jun 2, 2023

tekton-robot commented Jun 2, 2023

tekton-robot commented Jun 2, 2023

lbernick left a comment

Choose a reason for hiding this comment

gabemontero commented Jun 5, 2023

vdemeester commented Jun 5, 2023

lbernick commented Jun 5, 2023

tekton-robot commented Jun 5, 2023

gabemontero commented Jun 5, 2023

khrm commented Jun 6, 2023

gabemontero commented May 30, 2023 •

edited

Loading

khrm May 30, 2023 •

edited

Loading