Tekton shouldn't fail pipelinerun/taskrun for kubernetes container starting warning's. #7184

sauravdey · 2023-10-08T09:01:06Z

Expected Behavior

Intermediate failures on kubernetes container start is failing the pipelinerun/taskrun.
Failure's listed below:

Failed to create subPath directory for volumeMount.
ImagePullBack

In both the warning the pod/container will eventually start and running state. But tekton fails the tasks immediately. Also the pod will eventually run and take up resources and most of the time completes successfully.

Actual Behavior

Right now in imagepullback the pipeline fails immediately. But the pod eventually starts.
Also Failed to create subPath directory for volumeMount eventually succeeds but pipeline is marked failure.

Steps to Reproduce the Problem

When lot of pods/containers start at a time in any kubernetes node we will hit the issue
Set the registryPullQPS, registryBurst, to a lower number and start multiple pipelinerun/taskrun where the image should be present locally in the node. The pull will succeed eventually but pr/tr will fail.

Also use nfs(pvc) and create multiple subpath by running lot of pipelinerun(Most of the time there will be no issue) But sometime subpath creation fails for any one container and taskrun will fail but kubernetes handles it and recreates the subpath.

Additional Info

If we can not handle the use cases we should have some config to ignore these warnings.

Kubernetes version:

Output of kubectl version:

Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.11", GitCommit:"8cfcba0b15c343a8dc48567a74c29ec4844e0b9e", GitTreeState:"clean", BuildDate:"2023-06-14T09:49:38Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.47.2

The text was updated successfully, but these errors were encountered:

afrittoli · 2023-10-10T09:28:19Z

Thanks for the bug report.
Since v0.51 we handle InvalidImageName as a permanent error, which causes the Tekton to delete the Pod.

ImagePullBackOff is treated in the same way, because Tekton workloads are not restartable (like Pods), users cannot edit the image in a step and restart the same TaskRun, they must create a new one.
In some cases ImagePullBackOff may be caused by a temporary infrastructure issue (network, rate limiting or so), but since we have no way to distinguish those cases, we will always fail the TaskRun and kill the Pod on ImagePullBackOff.

From the issue description, it sounds like this is not what you're experiencing, it could be that the behaviour for ImagePullBackOff was introduced later than v0.47.2, I need to check in the code base.

About the "Failed to create subPath directory for volumeMount." - do you have an example Pod yaml with the failure that you can share?

sauravdey · 2023-10-10T16:09:43Z

@afrittoli sure small snippet for taskrun

kind: TaskRun
metadata:
  generateName: clone-test
spec:
  params:
  - name: repo-url
    value: [email protected]:test/test.git
  taskRef:
    name: tekton-git-clone
  podTemplate:
    nodeSelector:
      kubernetes.io/role: tekton-test
  serviceAccountName: tekton-test-svc
  timeout: 12h0m0s
  workspaces:
  - name: output
    persistentVolumeClaim:
      claimName: tekton-test-pvc
    subPath: builds/test/$(context.taskRun.name)

$(context.taskRun.name) is generated name which will be different all the times. and this path will be created when the pod starts.
tekton-test-pvc is a pvc which is backed by nfs

Most of the time the subpath creation will pass. But in case if it fails kubernetes handles it and recreates the path but tekton taskrun will fail with failed to create subPath directory for volumeMount "ws-24dfd" of container "test"

pritidesai · 2023-10-17T16:59:49Z

/help-wanted

@afrittoli to determine the exact error and document here, thanks!

afrittoli · 2023-10-17T17:01:06Z

Thanks @sauravdey for sharing the TaskRun.
I was hoping to see the Pod exact error message actually has that would help with the fix.

sauravdey · 2023-10-25T08:33:27Z

@afrittoli
There is only a event when the container starts.
failed to create subPath directory for volumeMount <volume-name> of container <container-name>

sauravdey · 2023-10-31T09:26:13Z

@afrittoli any update on this. do you need more information ?

pritidesai · 2024-02-05T21:40:15Z

In some cases ImagePullBackOff may be caused by a temporary infrastructure issue (network, rate limiting or so), but since we have no way to distinguish those cases, we will always fail the TaskRun and kill the Pod on ImagePullBackOff.

Hey @afrittoli, we are running into this problem in our infrastructure (see comment in issue #5987). How about introducing an opt in functionality to avoid treating ImagePullBackOff as a permanent error?

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <[email protected]> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <[email protected]>

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <[email protected]>

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 Signed-off-by: Priti Desai <[email protected]> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <[email protected]>

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <[email protected]> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <[email protected]>

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 This is a manual cheery-pick of tektoncd#7666 Signed-off-by: Priti Desai <[email protected]>

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 This is a manual cheery-pick of #7666 Signed-off-by: Priti Desai <[email protected]>

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 Signed-off-by: Priti Desai <[email protected]> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <[email protected]>

We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. tektoncd#5987 tektoncd#7184 Signed-off-by: Priti Desai <[email protected]> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <[email protected]>

vdemeester · 2024-07-01T07:54:43Z

Given that #7666 is merged (see docs), I'll go ahead and close this.

sauravdey added the kind/bug Categorizes issue or PR as related to a bug. label Oct 8, 2023

pritidesai mentioned this issue Feb 5, 2024

Configurable grace period for TaskRun pods in ImagePullBackOff #5987

Closed

pritidesai mentioned this issue Feb 14, 2024

Allow imagePullBackOff for the specified duration #7666

Merged

8 tasks

pritidesai mentioned this issue Feb 15, 2024

[release-v0.53.x] wait for a given duration in case of imagePullBackOff #7677

Merged

8 tasks

vdemeester closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tekton shouldn't fail pipelinerun/taskrun for kubernetes container starting warning's. #7184

Tekton shouldn't fail pipelinerun/taskrun for kubernetes container starting warning's. #7184

sauravdey commented Oct 8, 2023

afrittoli commented Oct 10, 2023

sauravdey commented Oct 10, 2023 •

edited

Loading

pritidesai commented Oct 17, 2023

afrittoli commented Oct 17, 2023

sauravdey commented Oct 25, 2023

sauravdey commented Oct 31, 2023

pritidesai commented Feb 5, 2024 •

edited

Loading

vdemeester commented Jul 1, 2024

Tekton shouldn't fail pipelinerun/taskrun for kubernetes container starting warning's. #7184

Tekton shouldn't fail pipelinerun/taskrun for kubernetes container starting warning's. #7184

Comments

sauravdey commented Oct 8, 2023

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

afrittoli commented Oct 10, 2023

sauravdey commented Oct 10, 2023 • edited Loading

pritidesai commented Oct 17, 2023

afrittoli commented Oct 17, 2023

sauravdey commented Oct 25, 2023

sauravdey commented Oct 31, 2023

pritidesai commented Feb 5, 2024 • edited Loading

vdemeester commented Jul 1, 2024

sauravdey commented Oct 10, 2023 •

edited

Loading

pritidesai commented Feb 5, 2024 •

edited

Loading