[release-v0.53.x] wait for a given duration in case of imagePullBackOff #7677

pritidesai · 2024-02-15T18:58:37Z

Changes

We have implemented imagePullBackOff to fail fast. The issue with this approach is, this can be a transient error depending on the infrastructure. Often times the node where the pod is scheduled experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff) compared to other authentication failure, missing image, etc. In case of a rate limit, the pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. But the fail fast approach results in a taskRun failure and hence pipelineRun results in a failure.

Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, controller returns a permanent failure.

#5987
#7184

This is a manual cheery-pick of #7666

/kind feature

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
Has Tests included if any functionality added or changed
pre-commit Passed
Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Configure default-imagepullbackoff-timeout to allow imagePullBackOff to retry and wait for the specified duration before failing the pipeline.

tekton-robot · 2024-02-15T19:07:15Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.9%	91.2%	-0.7
pkg/reconciler/taskrun/taskrun.go	87.2%	85.4%	-1.8

tekton-robot · 2024-02-15T19:08:27Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.9%	91.2%	-0.7
pkg/reconciler/taskrun/taskrun.go	87.2%	85.4%	-1.8

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <[email protected]>

tekton-robot · 2024-02-15T20:29:05Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.9%	91.2%	-0.7
pkg/reconciler/taskrun/taskrun.go	87.2%	85.4%	-1.8

tekton-robot · 2024-02-15T20:32:27Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage-df to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/config/default.go	91.9%	91.2%	-0.7
pkg/reconciler/taskrun/taskrun.go	87.2%	85.4%	-1.8

pritidesai · 2024-02-15T22:45:55Z

/retest

afrittoli

Thanks @pritidesai.
As discussed on slack - we usually don't do feature backports, but given that:

this is a "quality-of-life" change
the default behaviour does not change
the change is small enough
v0.53 is a rather recent LTS
I think it is acceptable.
/lgtm

vdemeester · 2024-02-26T16:10:22Z

/approve

tekton-robot · 2024-02-26T16:10:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

vdemeester · 2024-02-26T16:10:32Z

/meow

tekton-robot · 2024-02-26T16:10:37Z

@vdemeester:

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Feb 15, 2024

tekton-robot requested review from bobcatfish and dibyom February 15, 2024 18:58

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 15, 2024

pritidesai force-pushed the imagepullbackoff-0.53 branch from e2231cf to 0773b47 Compare February 15, 2024 20:23

afrittoli reviewed Feb 26, 2024

View reviewed changes

tekton-robot assigned afrittoli Feb 26, 2024

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2024

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2024

tekton-robot merged commit fca993b into tektoncd:release-v0.53.x Feb 26, 2024
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-v0.53.x] wait for a given duration in case of imagePullBackOff #7677

[release-v0.53.x] wait for a given duration in case of imagePullBackOff #7677

pritidesai commented Feb 15, 2024 •

edited

Loading

tekton-robot commented Feb 15, 2024

tekton-robot commented Feb 15, 2024

tekton-robot commented Feb 15, 2024

tekton-robot commented Feb 15, 2024

pritidesai commented Feb 15, 2024

afrittoli left a comment

vdemeester commented Feb 26, 2024

tekton-robot commented Feb 26, 2024

vdemeester commented Feb 26, 2024

tekton-robot commented Feb 26, 2024

[release-v0.53.x] wait for a given duration in case of imagePullBackOff #7677

[release-v0.53.x] wait for a given duration in case of imagePullBackOff #7677

Conversation

pritidesai commented Feb 15, 2024 • edited Loading

Changes

Submitter Checklist

Release Notes

tekton-robot commented Feb 15, 2024

tekton-robot commented Feb 15, 2024

tekton-robot commented Feb 15, 2024

tekton-robot commented Feb 15, 2024

pritidesai commented Feb 15, 2024

afrittoli left a comment

Choose a reason for hiding this comment

vdemeester commented Feb 26, 2024

tekton-robot commented Feb 26, 2024

vdemeester commented Feb 26, 2024

tekton-robot commented Feb 26, 2024

pritidesai commented Feb 15, 2024 •

edited

Loading