Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not allow negative requeue times #7589

Merged
merged 1 commit into from
Feb 6, 2024

Conversation

gabemontero
Copy link
Contributor

Fixes #7588

/kind bug

Changes

Use of the value of 0 for the taskrun/pipeline timeout, which per https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#configuring-a-failure-timeout for example means timeout is disabled, results in the waitTime passed to the Requeue event to be negative. This had the observed behavior of Requeue'ing immediately, and intense cycles of many reconcilations per second were observed if the TaskRun's/PipelineRun's state did not in fact change. This artificially constrained the peformance of the pipeline controller.

This change makes sure the wait time passed to the Requeue is not negative.

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • [n/a ] Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
  • [ /] Has Tests included if any functionality added or changed
  • [ /] Follows the commit message standard
  • [ /] Meets the Tekton contributor standards (including functionality, content, code)
  • [ /] Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • [ /] Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
  • [ n/a] Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

PipelineRuns and TaskRuns that disable timeouts will no longer experience rapid requeue reconciliations

@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Jan 19, 2024
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 19, 2024
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.3% 0.0
pkg/reconciler/taskrun/taskrun.go 85.2% 84.7% -0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.3% 0.0
pkg/reconciler/taskrun/taskrun.go 85.2% 84.7% -0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.3% 0.0
pkg/reconciler/taskrun/taskrun.go 85.2% 84.7% -0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.3% 0.0
pkg/reconciler/taskrun/taskrun.go 85.2% 84.7% -0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.3% 0.0
pkg/reconciler/taskrun/taskrun.go 85.2% 84.7% -0.5

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.3% 0.0
pkg/reconciler/taskrun/taskrun.go 85.2% 84.7% -0.5

@gabemontero
Copy link
Contributor Author

/assign @vdemeester
/assign @abayer
/assign @afrittoli

@khrm FYI

timeout = 6 * time.Hour
case elapsed < 24*time.Hour:
timeout = 24 * time.Hour
case elapsed < 48*time.Hour:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to document this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @khrm, we need to document these times somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions as to where to document @khrm @vdemeester ? The timeout feature is certainly discussed in the pipelinerun and taskrun API markdown docs. But I'm not sure if we should clutter API level discussion with some lower level controller implementation detail.

And there is not a current "controller implementation details" markdown. And I'm not sure if this warrants the creation of such a thing.

And do we drop some hint of the pros and cons of disabling timeouts?

Pending responses from you all, in the interim, I'll work on a doc update to the pipelinerun/taskrun api doc, where timeout is discussed. For the moment at least, I'll refrain from any disabling timeouts warnings. Perhaps such warning necessitate broader community agreement.

Thanks.

Copy link
Contributor Author

@gabemontero gabemontero Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option here @khrm @vdemeester is to just use the configured global default as the wait time, or some fraction of the configured global default, regardless of what the elapsed time is, when the user chooses to disable the timeout. That is probably sufficient here, and would allow existing controls as a means for the user to tweak things.

I'm leaning toward switching to that .... but I'll wait to hear from you all before making the change here.

This lends itself to a cleaner explanation in the existing sections for timeout of the pipelineruns.md and taskruns.md, as I realized, when I started tinkering with a doc change.

Also, if we go this path, I know realized we would probably want to update the release note to advertise this, right?

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I need to expand the 0 timeout checks to the other finally related timeouts, make sure there are not negative wait times for those as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabemontero If I understand correctly, as it is an internal / implementation detail thing, and very similar to something like the "global resync period" of the controller, which is not documented anywhere, might not make sense to document it ; at least in API / users docs. I agree with that, so maybe just some comments would make sense initially 🙃. I wish we had more "deep technical doc", but as we don't have them today, I don't want to block the PR on this indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have captured it correctly @vdemeester

That said, any thoughts from anyone on my simplification for the what wait time to use that I noted in #7589 (comment) ?

To some degree these staggered wait times I essentially just grabbed from the air doesn't feel ideal.

Also, with these requeues, if the item requeued with wait gets updated, it get reconciled on the update anyway, correct? So just using a constant value perhaps is just simpler / better, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if my explanation is not clear, I can submit the alternative as a separate commit so folks can compare

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

timeout = 6 * time.Hour
case elapsed < 24*time.Hour:
timeout = 24 * time.Hour
case elapsed < 48*time.Hour:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @khrm, we need to document these times somehow.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 23, 2024
timeout = 6 * time.Hour
case elapsed < 24*time.Hour:
timeout = 24 * time.Hour
case elapsed < 48*time.Hour:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabemontero If I understand correctly, as it is an internal / implementation detail thing, and very similar to something like the "global resync period" of the controller, which is not documented anywhere, might not make sense to document it ; at least in API / users docs. I agree with that, so maybe just some comments would make sense initially 🙃. I wish we had more "deep technical doc", but as we don't have them today, I don't want to block the PR on this indeed.

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 24, 2024
@vdemeester vdemeester added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Jan 24, 2024
@vdemeester
Copy link
Member

cc @JeromeJu @chitrangpatel @afrittoli
I think we may want to backport this to a few LTS as well.
/cherry-pick release-v0.56.x
/cherry-pick release-v0.53.x
/cherry-pick release-v0.50.x
/cherry-pick release-v0.47.x

@tekton-robot
Copy link
Collaborator

@vdemeester: once the present PR merges, I will cherry-pick it on top of release-v0.56.x in a new PR and assign it to you.

In response to this:

cc @JeromeJu @chitrangpatel @afrittoli
I think we may want to backport this to a few LTS as well.
/cherry-pick release-v0.56.x
/cherry-pick release-v0.53.x
/cherry-pick release-v0.50.x
/cherry-pick release-v0.47.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vdemeester
Copy link
Member

Oh it picks just one cherry-pick by comment ?
/cherry-pick release-v0.53.x

@vdemeester
Copy link
Member

/cherry-pick release-v0.50.x

@vdemeester
Copy link
Member

/cherry-pick release-v0.47.x

@gabemontero
Copy link
Contributor Author

@vdemeester @khrm I have pushed a separate commit that replaces staggered (but arbitrary) set of wait time based on elapsed time and just uses the configured timeout default as the wait time

If this approach seems more amenable to everyone, I can cleanup, add a blurb in the docs if desired, and squash appropriately.

If the original form is preferred, I'll just drop the commit.

Of if there are any other suggestions for wait time?

Or perhaps we do not requeue with a wait time in the case timeout has been disabled, and just return nil, and wait for updates to pods/taskruns/pipelineruns result in new reconciliations ?

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.2% -0.1
pkg/reconciler/taskrun/taskrun.go 85.2% 85.4% 0.1

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.2% -0.1
pkg/reconciler/taskrun/taskrun.go 85.2% 85.4% 0.1

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.2% -0.1
pkg/reconciler/taskrun/taskrun.go 85.2% 85.4% 0.1

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.2% -0.1
pkg/reconciler/taskrun/taskrun.go 85.2% 85.4% 0.1

@vdemeester
Copy link
Member

@vdemeester @khrm I have pushed a separate commit that replaces staggered (but arbitrary) set of wait time based on elapsed time and just uses the configured timeout default as the wait time

If this approach seems more amenable to everyone, I can cleanup, add a blurb in the docs if desired, and squash appropriately.

If the original form is preferred, I'll just drop the commit.

Of if there are any other suggestions for wait time?

Or perhaps we do not requeue with a wait time in the case timeout has been disabled, and just return nil, and wait for updates to pods/taskruns/pipelineruns result in new reconciliations ?

Make sense to me on the new commit, it makes things simpler and "in the hand of the user". We can always refine this later if need be.

cc @chitrangpatel @JeromeJu

@JeromeJu JeromeJu self-assigned this Jan 29, 2024
Use of the value of 0 for the taskrun/pipeline timeout, which per https://github.com/tektoncd/pipeline/blob/main/docs/pipelineruns.md#configuring-a-failure-timeout for example means timeout
is disabled, results in the waitTime passed to the Requeue event to be negative.  This had the observed behavior of Requeue'ing immediately, and intense cycles of many
reconcilations per second were observed if the TaskRun's/PipelineRun's state did not in fact change.  This artificially constrained the peformance of the pipeline controller.

This change makes sure the wait time passed to the Requeue is not negative.

rh-pre-commit.version: 2.1.0
rh-pre-commit.check-secrets: ENABLED
@gabemontero
Copy link
Contributor Author

@vdemeester @khrm I have pushed a separate commit that replaces staggered (but arbitrary) set of wait time based on elapsed time and just uses the configured timeout default as the wait time
If this approach seems more amenable to everyone, I can cleanup, add a blurb in the docs if desired, and squash appropriately.
If the original form is preferred, I'll just drop the commit.
Of if there are any other suggestions for wait time?
Or perhaps we do not requeue with a wait time in the case timeout has been disabled, and just return nil, and wait for updates to pods/taskruns/pipelineruns result in new reconciliations ?

Make sense to me on the new commit, it makes things simpler and "in the hand of the user". We can always refine this later if need be.

cc @chitrangpatel @JeromeJu

cool thanks @vdemeester .... I've gone ahead and squahsed the commits and cleaned up commented out code

also, with this simpler approach, the doc update that you and @khrm mentioned earlier seemed easier, less klunky, so I've gone head and included that as well.

PTAL and see if it is appropriate for that section of the API doc. Thanks.

@chitrangpatel @JeromeJu FYI

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.2% -0.1
pkg/reconciler/taskrun/taskrun.go 85.2% 85.4% 0.1

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/pipelinerun.go 92.2% 92.2% -0.1
pkg/reconciler/taskrun/taskrun.go 85.2% 85.4% 0.1

Copy link
Contributor

@khrm khrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is better and simpler. I think we can proceed with that. Docs seem to be in the correct place.

@gabemontero
Copy link
Contributor Author

the beta-integration test failure seems like an unrelated flake to me, best as I an tell

@gabemontero
Copy link
Contributor Author

/retest

@gabemontero
Copy link
Contributor Author

CI green ready for follow reviews / possible acceptance of change @JeromeJu @chitrangpatel @afrittoli @vdemeester @khrm

thanks

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JeromeJu, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [JeromeJu,vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vdemeester
Copy link
Member

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2024
@vdemeester
Copy link
Member

/retest

@tekton-robot tekton-robot merged commit 856fb4f into tektoncd:main Feb 6, 2024
12 checks passed
@tekton-robot
Copy link
Collaborator

@vdemeester: new pull request created: #7638

In response to this:

cc @JeromeJu @chitrangpatel @afrittoli
I think we may want to backport this to a few LTS as well.
/cherry-pick release-v0.56.x
/cherry-pick release-v0.53.x
/cherry-pick release-v0.50.x
/cherry-pick release-v0.47.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

@vdemeester: #7589 failed to apply on top of branch "release-v0.53.x":

Applying: do not allow negative requeue times
Using index info to reconstruct a base tree...
M	docs/pipelineruns.md
M	docs/taskruns.md
M	pkg/reconciler/pipelinerun/pipelinerun.go
M	pkg/reconciler/pipelinerun/pipelinerun_test.go
M	pkg/reconciler/taskrun/taskrun.go
M	pkg/reconciler/taskrun/taskrun_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/reconciler/taskrun/taskrun_test.go
Auto-merging pkg/reconciler/taskrun/taskrun.go
Auto-merging pkg/reconciler/pipelinerun/pipelinerun_test.go
Auto-merging pkg/reconciler/pipelinerun/pipelinerun.go
CONFLICT (content): Merge conflict in pkg/reconciler/pipelinerun/pipelinerun.go
Auto-merging docs/taskruns.md
Auto-merging docs/pipelineruns.md
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 do not allow negative requeue times
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

cc @JeromeJu @chitrangpatel @afrittoli
I think we may want to backport this to a few LTS as well.
/cherry-pick release-v0.56.x
/cherry-pick release-v0.53.x
/cherry-pick release-v0.50.x
/cherry-pick release-v0.47.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

@vdemeester: #7589 failed to apply on top of branch "release-v0.50.x":

Applying: do not allow negative requeue times
Using index info to reconstruct a base tree...
M	docs/pipelineruns.md
M	docs/taskruns.md
M	pkg/reconciler/pipelinerun/pipelinerun.go
M	pkg/reconciler/pipelinerun/pipelinerun_test.go
M	pkg/reconciler/taskrun/taskrun.go
M	pkg/reconciler/taskrun/taskrun_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/reconciler/taskrun/taskrun_test.go
Auto-merging pkg/reconciler/taskrun/taskrun.go
Auto-merging pkg/reconciler/pipelinerun/pipelinerun_test.go
Auto-merging pkg/reconciler/pipelinerun/pipelinerun.go
CONFLICT (content): Merge conflict in pkg/reconciler/pipelinerun/pipelinerun.go
Auto-merging docs/taskruns.md
Auto-merging docs/pipelineruns.md
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 do not allow negative requeue times
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

cc @JeromeJu @chitrangpatel @afrittoli
I think we may want to backport this to a few LTS as well.
/cherry-pick release-v0.56.x
/cherry-pick release-v0.53.x
/cherry-pick release-v0.50.x
/cherry-pick release-v0.47.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

@vdemeester: #7589 failed to apply on top of branch "release-v0.47.x":

Applying: do not allow negative requeue times
Using index info to reconstruct a base tree...
M	docs/pipelineruns.md
M	docs/taskruns.md
M	pkg/reconciler/pipelinerun/pipelinerun.go
M	pkg/reconciler/pipelinerun/pipelinerun_test.go
M	pkg/reconciler/taskrun/taskrun.go
M	pkg/reconciler/taskrun/taskrun_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/reconciler/taskrun/taskrun_test.go
Auto-merging pkg/reconciler/taskrun/taskrun.go
Auto-merging pkg/reconciler/pipelinerun/pipelinerun_test.go
CONFLICT (content): Merge conflict in pkg/reconciler/pipelinerun/pipelinerun_test.go
Auto-merging pkg/reconciler/pipelinerun/pipelinerun.go
CONFLICT (content): Merge conflict in pkg/reconciler/pipelinerun/pipelinerun.go
Auto-merging docs/taskruns.md
Auto-merging docs/pipelineruns.md
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 do not allow negative requeue times
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

cc @JeromeJu @chitrangpatel @afrittoli
I think we may want to backport this to a few LTS as well.
/cherry-pick release-v0.56.x
/cherry-pick release-v0.53.x
/cherry-pick release-v0.50.x
/cherry-pick release-v0.47.x

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabemontero gabemontero deleted the fix-neg-requeue-wait branch February 6, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

disabling pipelinerun/taskrun timeouts leads to overrun of reconcile requeues
6 participants