Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REGRESSION] Unable to cancel pipeline run - error PipelineRunCouldntCancel reported on a pending custom task #5282

Closed
rafalbigaj opened this issue Aug 8, 2022 · 7 comments · Fixed by #5287
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@rafalbigaj
Copy link
Contributor

Expected Behavior

Pipeline run with unscheduled custom test should be cancellable.

Regression: This scenario works correctly on v0.33.1.

Actual Behavior

When cancelling a pipeline run, which contains unscheduled custom test, the error PipelineRunCouldntCancel occurs.

Steps to Reproduce the Problem

  1. Install wait-task custom task controller.
  2. Run a sample pipeline run with tow steps, one depended on the other:
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: cancel-test-
spec:
  pipelineSpec:
    tasks:
    - name: wait-1
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
      params:
        - name: duration
          value: 1h
    - name: wait-2
      runAfter:
        - wait-1
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
      params:
        - name: duration
          value: 10s
  serviceAccountName: pipeline-runner
  1. Cancel the pipeline run setting spec.status to "Cancelled". The final PipelineRun status is:
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  creationTimestamp: "2022-08-08T09:09:13Z"
  generateName: cancel-test-
  generation: 2
  labels:
    tekton.dev/pipeline: cancel-test-qwndq
  name: cancel-test-qwndq
  namespace: wait-task
  resourceVersion: "10224"
  uid: 5cb9a96f-fbfe-4103-961f-81ff775c0a33
spec:
  pipelineSpec:
    tasks:
    - name: wait-1
      params:
      - name: duration
        value: 1h
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
        metadata: {}
        spec: null
    - name: wait-2
      params:
      - name: duration
        value: 10s
      runAfter:
      - wait-1
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
        metadata: {}
        spec: null
  serviceAccountName: pipeline-runner
  status: Cancelled
  timeout: 1h0m0s
status:
  conditions:
  - lastTransitionTime: "2022-08-08T09:10:18Z"
    message: 'PipelineRun "cancel-test-qwndq" was cancelled but had errors trying
      to cancel TaskRuns and/or Runs: Failed to patch Run `cancel-test-qwndq-wait-2`
      with cancellation: runs.tekton.dev "cancel-test-qwndq-wait-2" not found'
    reason: PipelineRunCouldntCancel
    status: Unknown
    type: Succeeded
  pipelineSpec:
    tasks:
    - name: wait-1
      params:
      - name: duration
        value: 1h
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
        metadata: {}
        spec: null
    - name: wait-2
      params:
      - name: duration
        value: 10s
      runAfter:
      - wait-1
      taskSpec:
        apiVersion: example.dev/v0
        kind: Wait
        metadata: {}
        spec: null
  runs:
    cancel-test-qwndq-wait-1:
      pipelineTaskName: wait-1
      status:
        extraFields: null
    cancel-test-qwndq-wait-2:
      pipelineTaskName: wait-2
  startTime: "2022-08-08T09:09:13Z"

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:36:49Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:
Client version: 0.25.0
Pipeline version: v0.38.2
@rafalbigaj rafalbigaj added the kind/bug Categorizes issue or PR as related to a bug. label Aug 8, 2022
@afrittoli
Copy link
Member

Thank you for the bug report @rafalbigaj - it sounds like we don't have test coverage for this.
Perhaps we should consider installing a simple custom run controller in our e2e tests so we may test these scenarios.
/cc @abayer @lbernick

@afrittoli
Copy link
Member

@rafalbigaj Does the cancel happen right away?
Both runs are in the status, which is why the controller attempts to cancel both and fails. It could be that that is the source of the issue.
Are you using minimal embedded status in your setup? Any other alpha flag enabled?

@abayer
Copy link
Contributor

abayer commented Aug 8, 2022

Perhaps we should consider installing a simple custom run controller in our e2e tests so we may test these scenarios

👍 Agreed!

@dibyom
Copy link
Member

dibyom commented Aug 8, 2022

cc #5120

@pritidesai pritidesai added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 8, 2022
@pritidesai
Copy link
Member

pritidesai commented Aug 8, 2022

Perhaps we should consider installing a simple custom run controller in our e2e tests so we may test these scenarios

Would this require changing pluming to deploy wait task controller or can it be done via e2e-common (similar to pipeline controller)? Should we bring wait task controller into pipeline repo to decouple from whats available in experimental repo?

@afrittoli
Copy link
Member

I wrote a reproducer unit test, so we can test this even without e2e tests, nonetheless it would be good to install the wait controller.

/*
Copyright 2022 The Tekton Authors

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package pipelinerun

import (
	"fmt"
	"testing"

	"github.com/tektoncd/pipeline/pkg/apis/pipeline/v1beta1"
	"github.com/tektoncd/pipeline/test"
	"github.com/tektoncd/pipeline/test/parse"
	corev1 "k8s.io/api/core/v1"
	logtesting "knative.dev/pkg/logging/testing"

	_ "knative.dev/pkg/system/testing" // Setup system.Namespace()
)

func TestReconcileTwoCustomTasks(t *testing.T) {
	pipelineRunName := "cancel-test-run"
	prs := []*v1beta1.PipelineRun{parse.MustParsePipelineRun(t, `metadata:
  name: cancel-test-run
  namespace: foo
spec:
  pipelineSpec:
    tasks:
      - name: wait-1
        taskSpec:
          apiVersion: example.dev/v0
          kind: Wait
          params:
            - name: duration
              value: 1h
      - name: wait-2
        runAfter:
          - wait-1
        taskSpec:
          apiVersion: example.dev/v0
          kind: Wait
          params:
            - name: duration
              value: 10s
`)}

	cms := []*corev1.ConfigMap{withCustomTasks(newFeatureFlagsConfigMap())}

	d := test.Data{
		PipelineRuns: prs,
		ConfigMaps:   cms,
	}
	prt := newPipelineRunTest(d, t)
	defer prt.Cancel()

	pr, clients := prt.reconcileRun("foo", pipelineRunName, []string{}, false)
	fmt.Printf("%v", pr)
	err := cancelPipelineRun(prt.TestAssets.Ctx, logtesting.TestLogger(t), pr, clients.Pipeline)
	if err != nil {
		t.Fatalf("Error found: %v", err)
	}
}

I used git bisect to find the first failing commit 9d60d0a

$ git bisect start v0.37.0 v0.36.0 --
$ git bisect run go test -v github.com/tektoncd/pipeline/pkg/reconciler/pipelinerun -run TestReconcileTwoCustomTasks
(...)
9d60d0adf41e74f78029a0eaab73140fa4f7206a is the first bad commit

On the surface it looks unrelated, I need to dig in to understand what happened.

@afrittoli
Copy link
Member

afrittoli commented Aug 9, 2022

The issue happens only with full status - with minimal status it works fine - i.e. in the test, using:

	cms := []*corev1.ConfigMap{withCustomTasks(withEmbeddedStatus(newFeatureFlagsConfigMap(), config.MinimalEmbeddedStatus))}

Works on main.

abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Aug 9, 2022
Fixes tektoncd#5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

Signed-off-by: Andrew Bayer <[email protected]>
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Aug 9, 2022
Fixes tektoncd#5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

Signed-off-by: Andrew Bayer <[email protected]>
afrittoli added a commit to afrittoli/pipeline that referenced this issue Aug 9, 2022
When a pipelinerun is cancelled, it may be that some of the resources
managed by the pipelinerun are missing (NotFound). In this case we
should not fail, and let the pipelinerun be cancelled.

Discussion about this was initially triggered by
tektoncd#5282, but this feature
makes sense regardless of the bug that exposed its need.

Signed-off-by: Andrea Frittoli <[email protected]>
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Aug 9, 2022
Fixes tektoncd#5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

Signed-off-by: Andrew Bayer <[email protected]>
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Aug 9, 2022
Fixes tektoncd#5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

Signed-off-by: Andrew Bayer <[email protected]>
tekton-robot pushed a commit that referenced this issue Aug 9, 2022
Fixes #5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

Signed-off-by: Andrew Bayer <[email protected]>
afrittoli pushed a commit to afrittoli/pipeline that referenced this issue Aug 9, 2022
Fixes tektoncd#5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

Signed-off-by: Andrew Bayer <[email protected]>
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Aug 9, 2022
Fixes tektoncd#5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

cherry-picked from tektoncd#5287, with some additional modifications to child reference population for custom tasks without a `Run` which have been fixed in v0.38.x.

Signed-off-by: Andrew Bayer <[email protected]>
tekton-robot pushed a commit that referenced this issue Aug 9, 2022
Fixes #5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

cherry-picked from #5287, with some additional modifications to child reference population for custom tasks without a `Run` which have been fixed in v0.38.x.

Signed-off-by: Andrew Bayer <[email protected]>
afrittoli pushed a commit to afrittoli/pipeline that referenced this issue Aug 10, 2022
Fixes tektoncd#5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

Signed-off-by: Andrew Bayer <[email protected]>
tekton-robot pushed a commit that referenced this issue Aug 10, 2022
Fixes #5282

When removing conditions, I accidentally made `GetRunsStatus` not skip including custom task `ResolvedPipelineTask`s without `Run`s in the `PipelineRun.Status.Runs` map. This resulted in cancellation failing, because the cancellation logic tries to patch each `Run` or `TaskRun` in the `PipelineRun` status (either in the `Runs` and `TaskRuns` maps for full embedded status, or `ChildReferences` for minimal or both embedded status). Unscheduled `PipelineTask`s for `Task`s were already properly excluded from the status map, and `ChildReferences` was fine either way, but due to the oversight I mentioned above, unscheduled `PipelineTask`s for custom tasks _were_ included in the full embedded status map, and cancellation would error due to that.

Signed-off-by: Andrew Bayer <[email protected]>
tekton-robot pushed a commit that referenced this issue Aug 18, 2022
When a pipelinerun is cancelled, it may be that some of the resources
managed by the pipelinerun are missing (NotFound). In this case we
should not fail, and let the pipelinerun be cancelled.

Discussion about this was initially triggered by
#5282, but this feature
makes sense regardless of the bug that exposed its need.

Signed-off-by: Andrea Frittoli <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants