Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taskruns will stay in Running(Pending) state if non existing pvc name is given to workspace #2994

Open
VeereshAradhya opened this issue Jul 23, 2020 · 23 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@VeereshAradhya
Copy link

Expected Behavior

The piplinerun should check if given pvc is existing or not and pipelinerun should fail if the pvc is not present

Actual Behavior

The pipelinerun will be in Running state and taskruns will be in Running(Pending) state

Steps to Reproduce the Problem

  1. Create a simple task which uses workspace
  2. Create a pipeline using the above task
  3. Create a pipelinerun to run the above pipeline or using tkn commands to run the pipeline

Additional Info

  • Kubernetes version:

    Output of kubectl version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean", BuildDate:"2020-01-29T21:26:39Z", GoVersion:"go1.14beta1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.3+8b0a82f", GitCommit:"8b0a82f", GitTreeState:"clean", BuildDate:"2020-07-10T05:34:00Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
 $ tkn version
Client version: dev
Pipeline version: v0.14.2
Triggers version: unknown

Command logs

$ tkn t ls
NAME                  DESCRIPTION   AGE
workspace-test-task                 29 minutes ago
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn p ls
NAME                      AGE              LAST RUN   STARTED   DURATION   STATUS
workspace-test-pipeline   28 minutes ago   ---        ---       ---        ---
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
No TaskRuns found
[varadhya@localhost workspace-testing]$ tkn p start workspace-test-pipeline 
Please give specifications for the workspace: shared-workspace 
? Name for the workspace : shared-workspace
? Value of the Sub Path :  
?  Type of the Workspace : pvc
? Value of Claim Name : simple-poc-non-existent
PipelineRun started: workspace-test-pipeline-run-xs5sw

In order to track the PipelineRun progress run:
tkn pipelinerun logs workspace-test-pipeline-run-xs5sw -f -n veeresh-testing
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn pr ls
NAME                                STARTED         DURATION   STATUS
workspace-test-pipeline-run-xs5sw   2 seconds ago   ---        Running
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                STARTED         DURATION   STATUS
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv   6 seconds ago   ---        Running(Pending)
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 

Running using kubectl

[varadhya@localhost workspace-testing]$ kubectl apply -f test-run.yaml 
pipelinerun.tekton.dev/workspace-test-pipeline-run-test-invlaid created
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn pr ls
NAME                                       STARTED         DURATION   STATUS
workspace-test-pipeline-run-test-invlaid   3 seconds ago   ---        Running
workspace-test-pipeline-run-xs5sw          1 minute ago    ---        Running
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                       STARTED         DURATION   STATUS
workspace-test-pipeline-run-test-invlaid-list-repo-48n4h   4 seconds ago   ---        Running(Pending)
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv          1 minute ago    ---        Running(Pending)
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                       STARTED          DURATION   STATUS
workspace-test-pipeline-run-test-invlaid-list-repo-48n4h   54 seconds ago   ---        Running(Pending)
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv          2 minutes ago    ---        Running(Pending)
[varadhya@localhost workspace-testing]$ tkn pr ls
NAME                                       STARTED          DURATION   STATUS
workspace-test-pipeline-run-test-invlaid   58 seconds ago   ---        Running
workspace-test-pipeline-run-xs5sw          2 minutes ago    ---        Running
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                       STARTED         DURATION   STATUS
workspace-test-pipeline-run-test-invlaid-list-repo-48n4h   1 minute ago    ---        Running(Pending)
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv          2 minutes ago   ---        Running(Pending)
[varadhya@localhost workspace-testing]$ 
[varadhya@localhost workspace-testing]$ tkn tr ls
NAME                                                       STARTED         DURATION   STATUS
workspace-test-pipeline-run-test-invlaid-list-repo-48n4h   5 minutes ago   ---        Running(Pending)
workspace-test-pipeline-run-xs5sw-list-repo-fzkzv          6 minutes ago   ---        Running(Pending)

task:

apiVersion: tekton.dev/v1beta1 
kind: Task 
metadata:
  name: workspace-test-task
spec:
  steps:
  - name: write-message
    image: ubuntu
    script: |
      #!/usr/bin/env bash
      set -xe
      echo hello! > $(workspaces.messages.path)/message
      cat $(workspaces.messages.path)/message
  workspaces:
  - name: messages
    description: The folder where we write the message to

pipeline:

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: workspace-test-pipeline
spec:
  workspaces:
  - name: shared-workspace
  tasks:
  - name: list-repo
    taskRef:
      name: workspace-test-task
      kind: Task
    workspaces:
      - name: messages
        workspace: shared-workspace

pipelinerun:

apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: workspace-test-pipeline-run-test-invlaid
  namespace: veeresh-testing
spec:
  pipelineRef:
    name: workspace-test-pipeline
  timeout: 1h0m0s
  workspaces:
  - name: shared-workspace
    persistentVolumeClaim:
      claimName: simple-pvc-non-existent
@VeereshAradhya VeereshAradhya added the kind/bug Categorizes issue or PR as related to a bug. label Jul 23, 2020
@vdemeester
Copy link
Member

/cc @jlpettersson @sbwsg

@VeereshAradhya VeereshAradhya changed the title the taskruns will stay in Running(Pending) state if non existing pvc name is given to workspace taskruns will stay in Running(Pending) state if non existing pvc name is given to workspace Jul 23, 2020
@ghost
Copy link

ghost commented Jul 23, 2020

I'm not sure that I agree that a PipelineRun should fail if a PVC doesn't exist the moment a Pod runs. Isn't it possible that the volume is being provisioned by some other service and will appear some time after the Pod launches? Or will a PVC always exist prior to a Pod launching?

@jlpettersson
Copy link
Member

Yes, this is the way Kubernetes works. It is hard to say if we should do it differently.

Kubernetes is an eventual consistency system. E.g. you may create a PVC and a Pod/TaskRun in the same "kubectl create"-command. But they are created by different controllers that solve different responsibilities - independent of eachother.

If the taskrun-controller should be responsible for this, it need to lookup PVCs and there might be a race from the same "kubectl create"-command, so a missing PVC might be created just a few milliseconds after the taskrun-controller does the PVC-lookup.

I agree that UX might be better to show this earlier. But that is contrary to the loosely-coupled Kubernetes architecture with eventual consistency that make it very scalable and with clear bounded responsibilities for controllers.

@vdemeester
Copy link
Member

@jlpettersson @sbwsg maybe we should at least update the Reason message to "say" that we are waiting for a PVC to be provisionned.

@GregDritschler
Copy link
Contributor

GregDritschler commented Jul 23, 2020

I agree that this is Kubernetes behavior and I'd be cautious about changing it. This is just one manifestation of a non-existent resource. You can get the same behavior for example when referencing a configmap that doesn't yet exist.

I did look into improving the display status for issue #2268 but didn't get a lot of feedback. Part of the problem is that if there are multiple pipeline branches active there can be multiple things going "wrong". It's hard to concisely summarize what's going on.

@ghost
Copy link

ghost commented Jul 23, 2020

@jlpettersson @sbwsg maybe we should at least update the Reason message to "say" that we are waiting for a PVC to be provisionned.

This is an interesting idea. I wonder if we can detect from Pod status / events whether it is waiting on a PVC.

@ghost
Copy link

ghost commented Jul 23, 2020

You can get the same behavior for example when referencing a configmap that doesn't yet exist.

Ah, good point, I hadn't considered the non-PVC scenarios too.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2020
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 20, 2020
@vdemeester
Copy link
Member

/remove-lifecycle rotten

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 7, 2020
@bobcatfish
Copy link
Collaborator

These seems similar to #3563 in that it's a question of whether or not we should add PVC specific behavior that differs from the k8s paradigm.

Since this issue went rotten and the discussion was pointing out that this is the standard k8s behavior (and the same question applies to other types like configmaps as @GregDritschler pointed out), I'm inclined to close this issue.

@vdemeester
Copy link
Member

These seems similar to #3563 in that it's a question of whether or not we should add PVC specific behavior that differs from the k8s paradigm.

I am not sure I fully agree on this. This issue is about what should we do in case of PVC not being ready at the time we try to create TaskRun/PipelineRun. Right now, we say we are pending because it might be available at some point in the future. But we never reconcile again (aka if it is available), so we always end-up timing out (after 1h or more).

From a user perspective, I would either assume we are waiting because it will be reconciled at some point or it should fail directly. I think we need to decide on this.

@ghost
Copy link

ghost commented Dec 8, 2020

But we never reconcile again (aka if it is available), so we always end-up timing out (after 1h or more).

OH. That sounds like a bug I 100% agree. I thought we've only been discussing the messages when a PVC isn't available; i definitely did not understand that this was also an issue describing a bug where the reconcile doesn't occur again!

@jlpettersson
Copy link
Member

From a user perspective, I would either assume we are waiting because it will be reconciled at some point or it should fail directly. I think we need to decide on this.

I don't think we should fail directly, as this is asynchronous (e.g. how Kubernetes works). This problem is very similar to #3378

@bobcatfish
Copy link
Collaborator

Ah kk thanks for explaining that extra detail @vdemeester I agree this is a bug.

I don't think we should fail directly, as this is asynchronous (e.g. how Kubernetes works).

I agree, I think we might want to subscribe to updates on the PVC if we can - does anyone know which part of the tekton code requires the PVC to exist?

@GregDritschler
Copy link
Contributor

But we never reconcile again (aka if it is available), so we always end-up timing out (after 1h or more).

OH. That sounds like a bug I 100% agree. I thought we've only been discussing the messages when a PVC isn't available; i definitely did not understand that this was also an issue describing a bug where the reconcile doesn't occur again!

This is news to me as well. I just tried it again and when the missing PVC is created and bound to a PV, the pod runs and the
TaskRun completes exactly as one would expect.

It is true that if the PVC is never created that the pipelinerun will timeout. Again that is what one would expect.

What is the scenario where the pipelinerun does not continue when the PVC is created and bound? I have not seen that reported before.

@jlpettersson
Copy link
Member

jlpettersson commented Dec 8, 2020

does anyone know which part of the tekton code requires the PVC to exist?

a TaskRun could hold-on to create a Pod until the PVC exists, perhaps? When creating a Pod from a TaskRun we know if a PVC is passed or not. But isn't this preventing concurrent work? E.g. a node could pull the image concurrently as it waits for a PVC to be created?

We could perhaps have a shorter timeout if the Pod never start to run? 1h is a bit long for that case?

What is the scenario where the pipelinerun does not continue when the PVC is created and bound? I have not seen that reported before.

Yes, there are many scenarios here, if we should handle them all. If we watch PVCs or so, Tekton can do more.... e.g. if the PV is located in a different AZ the Pod cannot be scheduled either - and there are more reasons to not be able to schedule a Pod.

@ghost
Copy link

ghost commented Dec 8, 2020

I'm a bit confused. I feel like there are two different conversations happening at once here:

  1. There is apparently a bug in Tekton where a reconcile does not happen even when a PVC is created and bound successfully after the TaskRun is created. I cannot reproduce this using a simple TaskRun + pvc combo in Minikube. Definitely need more input here on specifically how to repro @vdemeester .

  2. There is a broader question about how a TaskRun should report about themselves waiting on PVCs.

I am much more immediately interested in fixing the possible bug described in (1) (from comment) than the messaging part of this. That sounds like a serious bug that needs to be fixed.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 7, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vdemeester vdemeester removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 16, 2023
@vdemeester vdemeester reopened this Nov 16, 2023
@vdemeester
Copy link
Member

Re-opening this issue as I am not sure this has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

6 participants