Pod creation issue on Task retry with `workspace.<name>.volume` settings #7886

l-qing · 2024-04-16T12:49:30Z

Expected Behavior

The pipeline should be able to retry normally.

Actual Behavior

When retrying, the pipeline is unable to create the Pod for the retry.

failed to create task run pod "test-sidecar-workspace-run": Pod "test-sidecar-workspace-run-pod-retry1" is invalid: spec.containers[0].volumeMounts[0].name: Not found: "ws-jchh2". Maybe missing or invalid Task default/test-sidecar-workspace

Steps to Reproduce the Problem

Create the following Task and TaskRun.

cat <<'EOF' | kubectl replace --force -f -
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: test-sidecar-workspace
spec:
  workspaces:
  - description: cache
    name: cache
  steps:
  - name: command
    image: alpine
    volumeMounts:
    - mountPath: /cache
      name: $(workspaces.cache.volume)   # <-- It's this configuration that caused the above problem.
    script: |
      #!/bin/sh
      set -ex
      echo "hello world"
      exit 1
---
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  name: test-sidecar-workspace-run
spec:
  taskRef:
    name: test-sidecar-workspace
  retries: 1
  workspaces:
  - name: cache
    emptyDir: {}
EOF

Wait for the TaskRun to be processed.

apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  generation: 1
  labels:
    app.kubernetes.io/managed-by: tekton-pipelines
    tekton.dev/task: test-sidecar-workspace
  name: test-sidecar-workspace-run
spec:
  retries: 1
  serviceAccountName: default
  taskRef:
    kind: Task
    name: test-sidecar-workspace
  timeout: 1h0m0s
  workspaces:
    - emptyDir: {}
      name: cache
status:
  completionTime: "2024-04-16T09:58:14Z"
  conditions:
    - lastTransitionTime: "2024-04-16T09:58:14Z"
      message:
        'failed to create task run pod "test-sidecar-workspace-run": Pod "test-sidecar-workspace-run-pod-retry1"
        is invalid: spec.containers[0].volumeMounts[0].name: Not found: "ws-jchh2".
        Maybe missing or invalid Task default/test-sidecar-workspace'
      reason: PodCreationFailed
      status: "False"
      type: Succeeded
  podName: ""
  provenance:
    featureFlags:
      AwaitSidecarReadiness: true
      Coschedule: workspaces
      DisableAffinityAssistant: false
      DisableCredsInit: false
      EnableAPIFields: beta
      EnableArtifacts: false
      EnableCELInWhenExpression: false
      EnableKeepPodOnCancel: false
      EnableParamEnum: false
      EnableProvenanceInStatus: true
      EnableStepActions: false
      EnableTektonOCIBundles: false
      EnforceNonfalsifiability: none
      MaxResultSize: 4096
      RequireGitSSHSecretKnownHosts: false
      ResultExtractionMethod: termination-message
      RunningInEnvWithInjectedSidecars: true
      ScopeWhenExpressionsToTask: false
      SendCloudEventsForRuns: false
      SetSecurityContext: false
      VerificationNoMatchPolicy: ignore
  retriesStatus:
    - completionTime: "2024-04-16T09:58:14Z"
      conditions:
        - lastTransitionTime: "2024-04-16T09:58:14Z"
          message: '"step-command" exited with code 1'
          reason: Failed
          status: "False"
          type: Succeeded
      podName: test-sidecar-workspace-run-pod
      provenance:
        featureFlags:
          AwaitSidecarReadiness: true
          Coschedule: workspaces
          DisableAffinityAssistant: false
          DisableCredsInit: false
          EnableAPIFields: beta
          EnableArtifacts: false
          EnableCELInWhenExpression: false
          EnableKeepPodOnCancel: false
          EnableParamEnum: false
          EnableProvenanceInStatus: true
          EnableStepActions: false
          EnableTektonOCIBundles: false
          EnforceNonfalsifiability: none
          MaxResultSize: 4096
          RequireGitSSHSecretKnownHosts: false
          ResultExtractionMethod: termination-message
          RunningInEnvWithInjectedSidecars: true
          ScopeWhenExpressionsToTask: false
          SendCloudEventsForRuns: false
          SetSecurityContext: false
          VerificationNoMatchPolicy: ignore
      startTime: "2024-04-16T09:58:06Z"
      steps:
        - container: step-command
          imageID: docker.io/library/alpine@sha256:c5b1261d6d3e43071626931fc004f70149baeba2c8ec672bd4f27761f8e1ad6b
          name: command
          terminated:
            containerID: containerd://1beacfc9852b3a0c80eb30494bf80691d67533136e35f599d6bb62fe416bdfd1
            exitCode: 1
            finishedAt: "2024-04-16T09:58:13Z"
            reason: Error
            startedAt: "2024-04-16T09:58:13Z"
      taskSpec:
        steps:
          - image: alpine
            name: command
            resources: {}
            script: |
              #!/bin/sh
              set -ex
              echo "hello world"
              exit 1
            volumeMounts:
              - mountPath: /cache
                name: ws-jchh2
        workspaces:
          - description: cache
            name: cache
  startTime: "2024-04-16T09:58:14Z"
  steps:
    - container: step-command
      imageID: docker.io/library/alpine@sha256:c5b1261d6d3e43071626931fc004f70149baeba2c8ec672bd4f27761f8e1ad6b
      name: command
      terminated:
        containerID: containerd://1beacfc9852b3a0c80eb30494bf80691d67533136e35f599d6bb62fe416bdfd1
        exitCode: 1
        finishedAt: "2024-04-16T09:58:13Z"
        reason: Error
        startedAt: "2024-04-16T09:58:13Z"
  taskSpec:
    steps:
      - image: alpine
        name: command
        resources: {}
        script: |
          #!/bin/sh
          set -ex
          echo "hello world"
          exit 1
        volumeMounts:
          - mountPath: /cache
            name: ws-jchh2
    workspaces:
      - description: cache
        name: cache

Additional Info

Kubernetes version:

Output of kubectl version:

Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.8

Tekton Pipeline version:

Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

Client version: 0.36.0
Pipeline version: v0.58.0

Analysis

0. Summary

During the first reconciliation, the variable workspace.<name>.volume in the taskSpec was replaced with a random value.
The modified taskSpec was then stored in the taskRun.status.taskSpec.
When retrying, the replaced value of taskSpec from the status was used directly.
However, in CreateVolumes, the workspace’s name was generated again.
The newly generated volume name this time was inconsistent with the first one, which caused the above issue.

1. reconcile -> CreateVolumes && applyParamsContextsResultsAndWorkspaces && createPod

pipeline/pkg/reconciler/taskrun/taskrun.go

Lines 617 to 635 in f0a1d64

    
           // Get the randomized volume names assigned to workspace bindings 
        
           workspaceVolumes := workspace.CreateVolumes(tr.Spec.Workspaces) 
        
           ts, err := applyParamsContextsResultsAndWorkspaces(ctx, tr, rtr, workspaceVolumes) 
        
           if err != nil { 
        
           	logger.Errorf("Error updating task spec parameters, contexts, results and workspaces: %s", err) 
        
           	return err 
        
           } 
        
           tr.Status.TaskSpec = ts 
        
           if len(tr.Status.TaskSpec.Steps) > 0 { 
        
           	logger.Debugf("set taskspec for %s/%s - script: %s", tr.Namespace, tr.Name, tr.Status.TaskSpec.Steps[0].Script) 
        
           } 
        
           if pod == nil { 
        
           	pod, err = c.createPod(ctx, ts, tr, rtr, workspaceVolumes) 
        
           	if err != nil { 
        
           		newErr := c.handlePodCreationError(tr, err) 
        
           		logger.Errorf("Failed to create task run pod for taskrun %q: %v", tr.Name, newErr)

2.1 CreateVolumes -> RestrictLengthWithRandomSuffix

pipeline/pkg/workspace/apply.go

Lines 45 to 82 in f0a1d64

    
           // CreateVolumes will return a dictionary where the keys are the names of the workspaces bound in 
        
           // wb and the value is a newly-created Volume to use. If the same Volume is bound twice, the 
        
           // resulting volumes will both have the same name to prevent the same Volume from being attached 
        
           // to a pod twice. The names of the returned volumes will be a short random string starting "ws-". 
        
           func CreateVolumes(wb []v1.WorkspaceBinding) map[string]corev1.Volume { 
        
           	pvcs := map[string]corev1.Volume{} 
        
           	v := make(nameVolumeMap) 
        
           	for _, w := range wb { 
        
           		name := names.SimpleNameGenerator.RestrictLengthWithRandomSuffix(volumeNameBase) 
        
           		switch { 
        
           		case w.PersistentVolumeClaim != nil: 
        
           			// If it's a PVC, we need to check if we've encountered it before so we avoid mounting it twice 
        
           			if vv, ok := pvcs[w.PersistentVolumeClaim.ClaimName]; ok { 
        
           				v[w.Name] = vv 
        
           			} else { 
        
           				pvc := *w.PersistentVolumeClaim 
        
           				v.setVolumeSource(w.Name, name, corev1.VolumeSource{PersistentVolumeClaim: &pvc}) 
        
           				pvcs[pvc.ClaimName] = v[w.Name] 
        
           			} 
        
           		case w.EmptyDir != nil: 
        
           			ed := *w.EmptyDir 
        
           			v.setVolumeSource(w.Name, name, corev1.VolumeSource{EmptyDir: &ed}) 
        
           		case w.ConfigMap != nil: 
        
           			cm := *w.ConfigMap 
        
           			v.setVolumeSource(w.Name, name, corev1.VolumeSource{ConfigMap: &cm}) 
        
           		case w.Secret != nil: 
        
           			s := *w.Secret 
        
           			v.setVolumeSource(w.Name, name, corev1.VolumeSource{Secret: &s}) 
        
           		case w.Projected != nil: 
        
           			s := *w.Projected 
        
           			v.setVolumeSource(w.Name, name, corev1.VolumeSource{Projected: &s}) 
        
           		case w.CSI != nil: 
        
           			csi := *w.CSI 
        
           			v.setVolumeSource(w.Name, name, corev1.VolumeSource{CSI: &csi}) 
        
           		} 
        
           	} 
        
           	return v 
        
           }

2.2 RestrictLengthWithRandomSuffix -> random string

pipeline/pkg/names/generate.go

Lines 54 to 61 in f0a1d64

    
           // RestrictLengthWithRandomSuffix takes a base name and returns a potentially shortened version of that name with 
        
           // a random suffix, with the whole string no longer than 63 characters. 
        
           func (simpleNameGenerator) RestrictLengthWithRandomSuffix(base string) string { 
        
           	if len(base) > maxGeneratedNameLength { 
        
           		base = base[:maxGeneratedNameLength] 
        
           	} 
        
           	return fmt.Sprintf("%s-%s", base, utilrand.String(randomLength)) 
        
           }

3.1 applyParamsContextsResultsAndWorkspaces -> ApplyWorkspaces

pipeline/pkg/reconciler/taskrun/taskrun.go

Lines 905 to 946 in f0a1d64

    
           // applyParamsContextsResultsAndWorkspaces applies paramater, context, results and workspace substitutions to the TaskSpec. 
        
           func applyParamsContextsResultsAndWorkspaces(ctx context.Context, tr *v1.TaskRun, rtr *resources.ResolvedTask, workspaceVolumes map[string]corev1.Volume) (*v1.TaskSpec, error) { 
        
           	ts := rtr.TaskSpec.DeepCopy() 
        
           	var defaults []v1.ParamSpec 
        
           	if len(ts.Params) > 0 { 
        
           		defaults = append(defaults, ts.Params...) 
        
           	} 
        
           	// Apply parameter substitution from the taskrun. 
        
           	ts = resources.ApplyParameters(ts, tr, defaults...) 
        
           	// Apply context substitution from the taskrun 
        
           	ts = resources.ApplyContexts(ts, rtr.TaskName, tr) 
        
           	// Apply task result substitution 
        
           	ts = resources.ApplyResults(ts) 
        
           	// Apply step exitCode path substitution 
        
           	ts = resources.ApplyStepExitCodePath(ts) 
        
           	// Apply workspace resource substitution 
        
           	// propagate workspaces from taskrun to task. 
        
           	twn := []string{} 
        
           	for _, tw := range ts.Workspaces { 
        
           		twn = append(twn, tw.Name) 
        
           	} 
        
           	for _, trw := range tr.Spec.Workspaces { 
        
           		skip := false 
        
           		for _, tw := range twn { 
        
           			if tw == trw.Name { 
        
           				skip = true 
        
           				break 
        
           			} 
        
           		} 
        
           		if !skip { 
        
           			ts.Workspaces = append(ts.Workspaces, v1.WorkspaceDeclaration{Name: trw.Name}) 
        
           		} 
        
           	} 
        
           	ts = resources.ApplyWorkspaces(ctx, ts, ts.Workspaces, tr.Spec.Workspaces, workspaceVolumes) 
        
           	return ts, nil 
        
           }

3.2 ApplyWorkspaces -> ApplyReplacements

pipeline/pkg/reconciler/taskrun/resources/apply.go

Lines 277 to 310 in f0a1d64

    
           // ApplyWorkspaces applies the substitution from paths that the workspaces in declarations mounted to, the 
        
           // volumes that bindings are realized with in the task spec and the PersistentVolumeClaim names for the 
        
           // workspaces. 
        
           func ApplyWorkspaces(ctx context.Context, spec *v1.TaskSpec, declarations []v1.WorkspaceDeclaration, bindings []v1.WorkspaceBinding, vols map[string]corev1.Volume) *v1.TaskSpec { 
        
           	stringReplacements := map[string]string{} 
        
           	bindNames := sets.NewString() 
        
           	for _, binding := range bindings { 
        
           		bindNames.Insert(binding.Name) 
        
           	} 
        
           	for _, declaration := range declarations { 
        
           		prefix := fmt.Sprintf("workspaces.%s.", declaration.Name) 
        
           		if declaration.Optional && !bindNames.Has(declaration.Name) { 
        
           			stringReplacements[prefix+"bound"] = "false" 
        
           			stringReplacements[prefix+"path"] = "" 
        
           		} else { 
        
           			stringReplacements[prefix+"bound"] = "true" 
        
           			spec = applyWorkspaceMountPath(prefix+"path", spec, declaration) 
        
           		} 
        
           	} 
        
           	for name, vol := range vols { 
        
           		stringReplacements[fmt.Sprintf("workspaces.%s.volume", name)] = vol.Name 
        
           	} 
        
           	for _, binding := range bindings { 
        
           		if binding.PersistentVolumeClaim != nil { 
        
           			stringReplacements[fmt.Sprintf("workspaces.%s.claim", binding.Name)] = binding.PersistentVolumeClaim.ClaimName 
        
           		} else { 
        
           			stringReplacements[fmt.Sprintf("workspaces.%s.claim", binding.Name)] = "" 
        
           		} 
        
           	} 
        
           	return ApplyReplacements(spec, stringReplacements, map[string][]string{}, map[string]map[string]string{}) 
        
           }

3.3 `workspaces.%s.volume`

pipeline/pkg/reconciler/taskrun/resources/apply.go

Lines 299 to 301 in f0a1d64

    
           for name, vol := range vols { 
        
           	stringReplacements[fmt.Sprintf("workspaces.%s.volume", name)] = vol.Name 
        
           }

4. Set replaced taskSpec to TaskRun Status

pipeline/pkg/reconciler/taskrun/taskrun.go

Line 625 in f0a1d64

tr.Status.TaskSpec = ts

5. reconcile -> prepare -> GetTaskFuncFromTaskRun

pipeline/pkg/reconciler/taskrun/resources/taskref.go

Lines 55 to 79 in f0a1d64

    
           // GetTaskFuncFromTaskRun is a factory function that will use the given TaskRef as context to return a valid GetTask function. 
        
           // It also requires a kubeclient, tektonclient, namespace, and service account in case it needs to find that task in 
        
           // cluster or authorize against an external repositroy. It will figure out whether it needs to look in the cluster or in 
        
           // a remote image to fetch the  reference. It will also return the "kind" of the task being referenced. 
        
           // OCI bundle and remote resolution tasks will be verified by trusted resources if the feature is enabled 
        
           func GetTaskFuncFromTaskRun(ctx context.Context, k8s kubernetes.Interface, tekton clientset.Interface, requester remoteresource.Requester, taskrun *v1.TaskRun, verificationPolicies []*v1alpha1.VerificationPolicy) GetTask { 
        
           	// if the spec is already in the status, do not try to fetch it again, just use it as source of truth. 
        
           	// Same for the RefSource field in the Status.Provenance. 
        
           	if taskrun.Status.TaskSpec != nil { 
        
           		return func(_ context.Context, name string) (*v1.Task, *v1.RefSource, *trustedresources.VerificationResult, error) { 
        
           			var refSource *v1.RefSource 
        
           			if taskrun.Status.Provenance != nil { 
        
           				refSource = taskrun.Status.Provenance.RefSource 
        
           			} 
        
           			return &v1.Task{ 
        
           				ObjectMeta: metav1.ObjectMeta{ 
        
           					Name:      name, 
        
           					Namespace: taskrun.Namespace, 
        
           				}, 
        
           				Spec: *taskrun.Status.TaskSpec, 
        
           			}, refSource, nil, nil 
        
           		} 
        
           	} 
        
           	return GetTaskFunc(ctx, k8s, tekton, requester, taskrun, taskrun.Spec.TaskRef, taskrun.Name, taskrun.Namespace, taskrun.Spec.ServiceAccountName, verificationPolicies) 
        
           }

if the spec is already in the status, do not try to fetch it again, just use it as source of truth.

Solution

1. The content stored in taskRun.status.taskSpec is the original content of taskSpec, rather than the values after variable substitution.

Advantages: It's very versatile and there is no need to worry about retries being affected even if similar random variables are introduced in the future.

Disadvantages: The information in taskus.taskSpec is no longer intuitive. It's impossible to directly see the actual variable values during execution.

2. Each time the taskSpec definition is acquired, it is fetched from the remote end, rather than reusing the one in the status.

Disadvantages: It needs to be fetched every time reconciliation occurs, which is not very practical.

3. Change the method of generating volume names

No longer be completely random, but rather to hash the original name. Make sure that the name generated each time is the same and is not duplicated within the current workspaces.

In the current taskSpec, only the value of workspace.<name>.volume may be different each time; the values of other variables are fixed and will not change anymore.

I personally prefer to fix it using this solution.

4. Remove support for `workspace.<name>.volume`

The text was updated successfully, but these errors were encountered:

l-qing · 2024-04-16T12:49:45Z

/assign

…e>.volume` fix tektoncd#7886 Change the naming method of the workspace volume from completely random to hashed, ensuring that the name generated during a single taskRun lifecycle is consistent each time, and is unique within all current workspaces. This way, we can reuse the logic of retrieving the taskSpec from the status, and also store the content after variable expansion in the taskSpec of the status for easy debugging; it will also not affect the reconstruction of the pod when retrying.

tricktron · 2024-05-29T18:35:10Z

@l-qing I am facing this issue as well: Is there any manual workaround? Will this be backported to older stable versions?

l-qing · 2024-05-30T02:25:26Z

I am facing this issue as well: Is there any manual workaround? Will this be backported to older stable versions?

I didn't find a manual solution.
I thought it was a low probability issue, so I didn't consider reverting to the old stable version.
If you think it's necessary, you can ask @vdemeester for help.

l-qing added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024

tekton-robot assigned l-qing Apr 16, 2024

l-qing mentioned this issue Apr 16, 2024

fix: resolve pod creation failure on retry when using workspace.<name>.volume #7887

Merged

8 tasks

tekton-robot closed this as completed in #7887 Apr 25, 2024

tekton-robot closed this as completed in e556bc7 Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod creation issue on Task retry with `workspace.<name>.volume` settings #7886

Pod creation issue on Task retry with `workspace.<name>.volume` settings #7886

l-qing commented Apr 16, 2024

l-qing commented Apr 16, 2024

tricktron commented May 29, 2024

l-qing commented May 30, 2024

Pod creation issue on Task retry with workspace.<name>.volume settings #7886

Pod creation issue on Task retry with workspace.<name>.volume settings #7886

Comments

l-qing commented Apr 16, 2024

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

Analysis

0. Summary

1. reconcile -> CreateVolumes && applyParamsContextsResultsAndWorkspaces && createPod

2.1 CreateVolumes -> RestrictLengthWithRandomSuffix

2.2 RestrictLengthWithRandomSuffix -> random string

3.1 applyParamsContextsResultsAndWorkspaces -> ApplyWorkspaces

3.2 ApplyWorkspaces -> ApplyReplacements

3.3 workspaces.%s.volume

4. Set replaced taskSpec to TaskRun Status

5. reconcile -> prepare -> GetTaskFuncFromTaskRun

Solution

1. The content stored in taskRun.status.taskSpec is the original content of taskSpec, rather than the values after variable substitution.

2. Each time the taskSpec definition is acquired, it is fetched from the remote end, rather than reusing the one in the status.

3. Change the method of generating volume names

4. Remove support for workspace.<name>.volume

l-qing commented Apr 16, 2024

tricktron commented May 29, 2024

l-qing commented May 30, 2024

Pod creation issue on Task retry with `workspace.<name>.volume` settings #7886

Pod creation issue on Task retry with `workspace.<name>.volume` settings #7886

3.3 `workspaces.%s.volume`

4. Remove support for `workspace.<name>.volume`