Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints #6744

Merged
merged 1 commit into from
Jun 5, 2023

Conversation

gabemontero
Copy link
Contributor

@gabemontero gabemontero commented May 30, 2023

Changes

Fixes #6631

/kind feature

This commit adds a new experimental gauge metrics that counts the number of TaskRuns whose
underlying Pods are currently not scheduled to run by Kubernetes:

  • one metric counts when Kubernetes ResourceQuota policies within a Namespace prevent scheduling
  • a second metric counts when underlying Node level CPU or Memory utilization are such that the underlying Pod
    cannot be scheduled

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • [ /] Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • [ /] Has Tests included if any functionality added or changed
  • [/ ] Follows the commit message standard
  • Meets the Tekton contributor standards (including functionality, content, code)
  • [/ ] Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • [/ ] Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • [n/a ] Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

A new gauge metric for both PipelineRun and TaskRun will indicate whether underlying Pods are being throttled by Kubernetes because of either ResourceQuota policies defined in the namespace, or because the underlying node is experiencing resource constraints.

@vdemeester @khrm ptal

@tekton-robot tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 30, 2023
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 83.3% 3.3
pkg/taskrunmetrics/metrics.go 83.8% 85.0% 1.2

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 83.3% 3.3
pkg/taskrunmetrics/metrics.go 83.8% 85.0% 1.2

case pod.ReasonExceededResourceQuota:
prsThrottledByQuota++
// only count once per pipelinerun
exitForLoop = true
Copy link
Contributor

@khrm khrm May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use break instead of exitForLoop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recollection was that break only exited one for loop, and not both

The intent was to avoid bumping the counter for a pipelinerun multiple times because multiple taskruns are throttled.

Hence I want to leave the for _, pr :=range prs for loop as well.

Given that context, are you OK then with leaving this as is @khrm ? Or do you think I'm missing something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait if what I 'm saying is the case, I'm not breaking out of the last loop

yeah let me revisit @khrm .... either I break out of the last loop as well, or I just break here.

I'll update after I have fully revisited - thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK @khrm - while I did not add a break exactly here as you mentioned, you flagging this made me realize what I had previously pushed was not quite what I intended, which is:

  1. if a pipelinerun is both throttled on resource quota and node constraints with one of its underlying taskruns, bump the count once for resource quota and node constraint
  2. if a pipelinerun has multiple taskruns throttled by either resource quota or node constraints, don't bump the counts for each taskruns, but just once for the pipelinerun

aside from the code changes (2 booleans vs. one essentially for tracking this), I augmented the unit tests to create more than one taskrun per pipelinerun

PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current logic looks good to me.

case pod.ReasonExceededNodeResources:
prsThrottledByNode++
// only count once per pipelinerun
exitForLoop = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: Can we use break instead of exitForLoop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 83.7% 3.7
pkg/taskrunmetrics/metrics.go 83.8% 85.0% 1.2

@lbernick lbernick removed their request for review May 30, 2023 18:56
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pipelinerunmetrics/metrics.go 80.0% 83.7% 3.7
pkg/taskrunmetrics/metrics.go 83.8% 85.0% 1.2

Copy link
Contributor

@khrm khrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@khrm
Copy link
Contributor

khrm commented May 31, 2023

/assign @vdemeester

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khrm, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 31, 2023
@khrm
Copy link
Contributor

khrm commented May 31, 2023

Need to rebase this with the latest? @gabemontero

@gabemontero
Copy link
Contributor Author

Need to rebase this with the latest? @gabemontero

yeah I saw tide citing that earlier today @khrm but now it just says it needs the lgtm label

docs/metrics.md Outdated
| `tekton_pipelines_controller_taskrun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `status`=&lt;status&gt; <br> `*task`=&lt;task_name&gt; <br> `*taskrun`=&lt;taskrun_name&gt;<br> `namespace`=&lt;pipelineruns-taskruns-namespace&gt; | experimental |
| `tekton_pipelines_controller_taskrun_count` | Counter | `status`=&lt;status&gt; | experimental |
| `tekton_pipelines_controller_running_taskruns_count` | Gauge | | experimental |
| `tekton_pipelines_controller_running_taskruns_throttled_by_node_count` | Gauge | | experimental |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be quota_count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep will update in next push - thanks

if err != nil {
return fmt.Errorf("failed to list pipelineruns while generating metrics : %w", err)
}

var runningPRs int
var prsThrottledByQuota int
var prsThrottledByNode int
for _, pr := range prs {
if !pr.IsDone() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this if statement be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if in conjunction I move the runningPRs++ to after my

if pr.IsDone() {
  continue
}

block then yes. In hindsight, I was leaning toward preserving existing code path when possible.

I'll update in the nex push, thanls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming we keep the pipelinerun version of the metric re: our discussion starting at https://github.com/tektoncd/pipeline/pull/6744/files#r1213456583

ctx, _ := ttesting.SetupFakeContext(t)
informer := faketaskruninformer.Get(ctx)
// Add N randomly-named TaskRuns with differently-succeeded statuses.
for _, tr := range []*v1beta1.TaskRun{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to update this test to use table-driven testing instead?

@@ -344,17 +369,33 @@ func (r *Recorder) RunningTaskRuns(ctx context.Context, lister listers.TaskRunLi
}

var runningTrs int
var trsThrottledByQuota int
var trsThrottledByNode int
for _, pr := range trs {
if !pr.IsDone() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this if statement be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep similar to the other pipelinerun discussion thread

for _, tr := range trs {
// NOTE: while GetIndexer().Add(obj) works well in these unit tests for List(), for
// trLister.TaskRuns(trNamespace).Get(trName) to work correctly we need to use GetStore() instead.
// see AddToInformer in test/controller.go for an analogous solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use SeedTestData or AddToInformer from test/controller.go instead? (here and in the taskrun tests)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was trying to stay as close as possible to the patterns previously established in this test, like what I saw in TestRecordRunningPipelineRunsCount

But sure, I can tinker with say using SeedTestData in my new test here and see where that lands.

reasons := []string{"", pod.ReasonExceededResourceQuota, pod.ReasonExceededNodeResources}

// Add N randomly-named PipelineRuns with differently-succeeded statuses.
for _, tr := range []*v1beta1.PipelineRun{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this use table-driven testing instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar rationale to https://github.com/tektoncd/pipeline/pull/6744/files#r1214733673 but sure I'll see about factoring that in as well.

@@ -59,6 +61,16 @@ var (
"Number of pipelineruns executing currently",
stats.UnitDimensionless)
runningPRsCountView *view.View

runningPRsThrottledByQuotaCount = stats.Float64("running_pipelineruns_throttled_by_quota_count",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious, how useful is it to have separate pipelinerun metrics for this when they are derived from the taskruns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair question. Yeah, in theory, any of the existing total count/running count/duration metrics could be derived from their taskrun children as well, even if their current implementations do not need to venture into pr.Status.ChildrenReferences.

Argument for: it is an aggregation convenience, especially with dynamically generated pipelinerun names
Argument against: perhaps you can sort out aggregation, regardless of dynamically generated pipelinerun names ?? ... not sure I am 100% confident of the answer there in the context of how knative exports metrics to things like prometheus

If you feel strongly that it is redundant @lbernick I'm fine with pruning this PR to just the taskrun based metric, at least for the initial drop, and then we can get some real world feedback.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intuition is to stay consistent w/ existing pod related metrics (we provide tekton_pipelines_controller_taskruns_pod_latency but not tekton_pipelines_controller_pipelineruns_pod_latency) and in favor of a smaller more focused PR; however if you feel it's helpful/convenient to have this at the pipelineRun level that's fine too! especially considering it's an experimental metric, we could always decide to remove it later if it's not helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that comparison with pod latency resonates @lbernick

I'll remove the pipelinerun related metric

thanks

@lbernick
Copy link
Member

lbernick commented Jun 1, 2023

Thanks for your PR @gabemontero! would you mind including a commit message/PR description with a quick summary of the changes and why you're making them?

@gabemontero gabemontero changed the title add pipelinerun/taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints add taskrun gauge metrics for k8s throttling because of defined resource quotas or k8s node constraints Jun 2, 2023
…rce quotas or k8s live node constraints

This commit adds a new experimental gauge metrics that counts the number of TaskRuns whose
underlying Pods are currently not scheduled to run by Kubernetes:
- one metric counts when Kubernetes ResourceQuota policies within a Namespace prevent scheduling
- a second metric counts when underlying Node level CPU or Memory utilization are such that the underlying Pod
cannot be scheduled
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/taskrunmetrics/metrics.go 83.8% 84.9% 1.1

@gabemontero
Copy link
Contributor Author

Thanks for your PR @gabemontero! would you mind including a commit message/PR description with a quick summary of the changes and why you're making them?

OK @lbernick I believe I've pushed changes that address all the in-line code discussion, and I've attempted to address your commit message / PR description point ^^, though I'm not 100% sure I covered everything you might have expected with the description / summary, so please advise if I missed a nuance there.

PTAL and thanks :-)

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/taskrunmetrics/metrics.go 83.8% 84.9% 1.1

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/taskrunmetrics/metrics.go 83.8% 84.9% 1.1

Copy link
Member

@lbernick lbernick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 5, 2023
@tekton-robot tekton-robot merged commit 4924b51 into tektoncd:main Jun 5, 2023
@gabemontero
Copy link
Contributor Author

hey @vdemeester - WDYT about cherry picking back to 0.47.x ?

@vdemeester
Copy link
Member

I think it does make sense yes. @lbernick @afrittoli @jerop SGTY ?

@lbernick
Copy link
Member

lbernick commented Jun 5, 2023

I'm totally fine with backporting this commit, just want to make sure we don't have the expectation that new features in general will be backported, just bug fixes.

/cherry-pick release-v0.47.x

@tekton-robot
Copy link
Collaborator

@lbernick: new pull request created: #6774

In response to this:

I'm totally fine with backporting this commit, just want to make sure we don't have the expectation that new features in general will be backported, just bug fixes.

/cherry-pick release-v0.47.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabemontero
Copy link
Contributor Author

I'm totally fine with backporting this commit, just want to make sure we don't have the expectation that new features in general will be backported, just bug fixes.

yep agreed @lbernick - the relative small size of the change in the end was the only rationale that lead me to even asking

thanks for agreeing

/cherry-pick release-v0.47.x

@gabemontero gabemontero deleted the prototype-throttle-metrics branch June 5, 2023 19:27
@khrm
Copy link
Contributor

khrm commented Jun 6, 2023

I think metrics on pipelinerun also made sense. Because there's no 1-1 relation between blocked taskruns and pipelineruns. A pipelinerun multiple taskruns could be blocked, isn't it so? So, pipelinerun metric reflected blocked pipelinruns which could have multiple taskruns blocked.

gabemontero added a commit to gabemontero/pipeline that referenced this pull request Apr 11, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
gabemontero added a commit to gabemontero/pipeline that referenced this pull request Apr 12, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
gabemontero added a commit to gabemontero/pipeline that referenced this pull request Apr 12, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED
gabemontero added a commit to gabemontero/pipeline that referenced this pull request Apr 23, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
gabemontero added a commit to gabemontero/pipeline that referenced this pull request Apr 24, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
gabemontero added a commit to gabemontero/pipeline that referenced this pull request Apr 30, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
gabemontero added a commit to gabemontero/pipeline that referenced this pull request May 1, 2024
Back when implementing tektoncd#6744 for tektoncd#6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
tekton-robot pushed a commit that referenced this pull request May 10, 2024
Back when implementing #6744 for #6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add metrics to capture k8s throttling impact on tekton execution
5 participants