Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod creation issue on Task retry with workspace.<name>.volume settings #7886

Closed
l-qing opened this issue Apr 16, 2024 · 3 comments · Fixed by #7887
Closed

Pod creation issue on Task retry with workspace.<name>.volume settings #7886

l-qing opened this issue Apr 16, 2024 · 3 comments · Fixed by #7887
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@l-qing
Copy link
Contributor

l-qing commented Apr 16, 2024

Expected Behavior

The pipeline should be able to retry normally.

Actual Behavior

When retrying, the pipeline is unable to create the Pod for the retry.

failed to create task run pod "test-sidecar-workspace-run": Pod "test-sidecar-workspace-run-pod-retry1" is invalid: spec.containers[0].volumeMounts[0].name: Not found: "ws-jchh2". Maybe missing or invalid Task default/test-sidecar-workspace

Steps to Reproduce the Problem

  1. Create the following Task and TaskRun.
cat <<'EOF' | kubectl replace --force -f -
apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: test-sidecar-workspace
spec:
  workspaces:
  - description: cache
    name: cache
  steps:
  - name: command
    image: alpine
    volumeMounts:
    - mountPath: /cache
      name: $(workspaces.cache.volume)   # <-- It's this configuration that caused the above problem.
    script: |
      #!/bin/sh
      set -ex
      echo "hello world"
      exit 1
---
apiVersion: tekton.dev/v1
kind: TaskRun
metadata:
  name: test-sidecar-workspace-run
spec:
  taskRef:
    name: test-sidecar-workspace
  retries: 1
  workspaces:
  - name: cache
    emptyDir: {}
EOF
  1. Wait for the TaskRun to be processed.
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  generation: 1
  labels:
    app.kubernetes.io/managed-by: tekton-pipelines
    tekton.dev/task: test-sidecar-workspace
  name: test-sidecar-workspace-run
spec:
  retries: 1
  serviceAccountName: default
  taskRef:
    kind: Task
    name: test-sidecar-workspace
  timeout: 1h0m0s
  workspaces:
    - emptyDir: {}
      name: cache
status:
  completionTime: "2024-04-16T09:58:14Z"
  conditions:
    - lastTransitionTime: "2024-04-16T09:58:14Z"
      message:
        'failed to create task run pod "test-sidecar-workspace-run": Pod "test-sidecar-workspace-run-pod-retry1"
        is invalid: spec.containers[0].volumeMounts[0].name: Not found: "ws-jchh2".
        Maybe missing or invalid Task default/test-sidecar-workspace'
      reason: PodCreationFailed
      status: "False"
      type: Succeeded
  podName: ""
  provenance:
    featureFlags:
      AwaitSidecarReadiness: true
      Coschedule: workspaces
      DisableAffinityAssistant: false
      DisableCredsInit: false
      EnableAPIFields: beta
      EnableArtifacts: false
      EnableCELInWhenExpression: false
      EnableKeepPodOnCancel: false
      EnableParamEnum: false
      EnableProvenanceInStatus: true
      EnableStepActions: false
      EnableTektonOCIBundles: false
      EnforceNonfalsifiability: none
      MaxResultSize: 4096
      RequireGitSSHSecretKnownHosts: false
      ResultExtractionMethod: termination-message
      RunningInEnvWithInjectedSidecars: true
      ScopeWhenExpressionsToTask: false
      SendCloudEventsForRuns: false
      SetSecurityContext: false
      VerificationNoMatchPolicy: ignore
  retriesStatus:
    - completionTime: "2024-04-16T09:58:14Z"
      conditions:
        - lastTransitionTime: "2024-04-16T09:58:14Z"
          message: '"step-command" exited with code 1'
          reason: Failed
          status: "False"
          type: Succeeded
      podName: test-sidecar-workspace-run-pod
      provenance:
        featureFlags:
          AwaitSidecarReadiness: true
          Coschedule: workspaces
          DisableAffinityAssistant: false
          DisableCredsInit: false
          EnableAPIFields: beta
          EnableArtifacts: false
          EnableCELInWhenExpression: false
          EnableKeepPodOnCancel: false
          EnableParamEnum: false
          EnableProvenanceInStatus: true
          EnableStepActions: false
          EnableTektonOCIBundles: false
          EnforceNonfalsifiability: none
          MaxResultSize: 4096
          RequireGitSSHSecretKnownHosts: false
          ResultExtractionMethod: termination-message
          RunningInEnvWithInjectedSidecars: true
          ScopeWhenExpressionsToTask: false
          SendCloudEventsForRuns: false
          SetSecurityContext: false
          VerificationNoMatchPolicy: ignore
      startTime: "2024-04-16T09:58:06Z"
      steps:
        - container: step-command
          imageID: docker.io/library/alpine@sha256:c5b1261d6d3e43071626931fc004f70149baeba2c8ec672bd4f27761f8e1ad6b
          name: command
          terminated:
            containerID: containerd://1beacfc9852b3a0c80eb30494bf80691d67533136e35f599d6bb62fe416bdfd1
            exitCode: 1
            finishedAt: "2024-04-16T09:58:13Z"
            reason: Error
            startedAt: "2024-04-16T09:58:13Z"
      taskSpec:
        steps:
          - image: alpine
            name: command
            resources: {}
            script: |
              #!/bin/sh
              set -ex
              echo "hello world"
              exit 1
            volumeMounts:
              - mountPath: /cache
                name: ws-jchh2
        workspaces:
          - description: cache
            name: cache
  startTime: "2024-04-16T09:58:14Z"
  steps:
    - container: step-command
      imageID: docker.io/library/alpine@sha256:c5b1261d6d3e43071626931fc004f70149baeba2c8ec672bd4f27761f8e1ad6b
      name: command
      terminated:
        containerID: containerd://1beacfc9852b3a0c80eb30494bf80691d67533136e35f599d6bb62fe416bdfd1
        exitCode: 1
        finishedAt: "2024-04-16T09:58:13Z"
        reason: Error
        startedAt: "2024-04-16T09:58:13Z"
  taskSpec:
    steps:
      - image: alpine
        name: command
        resources: {}
        script: |
          #!/bin/sh
          set -ex
          echo "hello world"
          exit 1
        volumeMounts:
          - mountPath: /cache
            name: ws-jchh2
    workspaces:
      - description: cache
        name: cache

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.8
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

Client version: 0.36.0
Pipeline version: v0.58.0

Analysis

0. Summary

  1. During the first reconciliation, the variable workspace.<name>.volume in the taskSpec was replaced with a random value.
  2. The modified taskSpec was then stored in the taskRun.status.taskSpec.
  3. When retrying, the replaced value of taskSpec from the status was used directly.
  4. However, in CreateVolumes, the workspace’s name was generated again.
  5. The newly generated volume name this time was inconsistent with the first one, which caused the above issue.

1. reconcile -> CreateVolumes && applyParamsContextsResultsAndWorkspaces && createPod

// Get the randomized volume names assigned to workspace bindings
workspaceVolumes := workspace.CreateVolumes(tr.Spec.Workspaces)
ts, err := applyParamsContextsResultsAndWorkspaces(ctx, tr, rtr, workspaceVolumes)
if err != nil {
logger.Errorf("Error updating task spec parameters, contexts, results and workspaces: %s", err)
return err
}
tr.Status.TaskSpec = ts
if len(tr.Status.TaskSpec.Steps) > 0 {
logger.Debugf("set taskspec for %s/%s - script: %s", tr.Namespace, tr.Name, tr.Status.TaskSpec.Steps[0].Script)
}
if pod == nil {
pod, err = c.createPod(ctx, ts, tr, rtr, workspaceVolumes)
if err != nil {
newErr := c.handlePodCreationError(tr, err)
logger.Errorf("Failed to create task run pod for taskrun %q: %v", tr.Name, newErr)

2.1 CreateVolumes -> RestrictLengthWithRandomSuffix

// CreateVolumes will return a dictionary where the keys are the names of the workspaces bound in
// wb and the value is a newly-created Volume to use. If the same Volume is bound twice, the
// resulting volumes will both have the same name to prevent the same Volume from being attached
// to a pod twice. The names of the returned volumes will be a short random string starting "ws-".
func CreateVolumes(wb []v1.WorkspaceBinding) map[string]corev1.Volume {
pvcs := map[string]corev1.Volume{}
v := make(nameVolumeMap)
for _, w := range wb {
name := names.SimpleNameGenerator.RestrictLengthWithRandomSuffix(volumeNameBase)
switch {
case w.PersistentVolumeClaim != nil:
// If it's a PVC, we need to check if we've encountered it before so we avoid mounting it twice
if vv, ok := pvcs[w.PersistentVolumeClaim.ClaimName]; ok {
v[w.Name] = vv
} else {
pvc := *w.PersistentVolumeClaim
v.setVolumeSource(w.Name, name, corev1.VolumeSource{PersistentVolumeClaim: &pvc})
pvcs[pvc.ClaimName] = v[w.Name]
}
case w.EmptyDir != nil:
ed := *w.EmptyDir
v.setVolumeSource(w.Name, name, corev1.VolumeSource{EmptyDir: &ed})
case w.ConfigMap != nil:
cm := *w.ConfigMap
v.setVolumeSource(w.Name, name, corev1.VolumeSource{ConfigMap: &cm})
case w.Secret != nil:
s := *w.Secret
v.setVolumeSource(w.Name, name, corev1.VolumeSource{Secret: &s})
case w.Projected != nil:
s := *w.Projected
v.setVolumeSource(w.Name, name, corev1.VolumeSource{Projected: &s})
case w.CSI != nil:
csi := *w.CSI
v.setVolumeSource(w.Name, name, corev1.VolumeSource{CSI: &csi})
}
}
return v
}

2.2 RestrictLengthWithRandomSuffix -> random string

// RestrictLengthWithRandomSuffix takes a base name and returns a potentially shortened version of that name with
// a random suffix, with the whole string no longer than 63 characters.
func (simpleNameGenerator) RestrictLengthWithRandomSuffix(base string) string {
if len(base) > maxGeneratedNameLength {
base = base[:maxGeneratedNameLength]
}
return fmt.Sprintf("%s-%s", base, utilrand.String(randomLength))
}

3.1 applyParamsContextsResultsAndWorkspaces -> ApplyWorkspaces

// applyParamsContextsResultsAndWorkspaces applies paramater, context, results and workspace substitutions to the TaskSpec.
func applyParamsContextsResultsAndWorkspaces(ctx context.Context, tr *v1.TaskRun, rtr *resources.ResolvedTask, workspaceVolumes map[string]corev1.Volume) (*v1.TaskSpec, error) {
ts := rtr.TaskSpec.DeepCopy()
var defaults []v1.ParamSpec
if len(ts.Params) > 0 {
defaults = append(defaults, ts.Params...)
}
// Apply parameter substitution from the taskrun.
ts = resources.ApplyParameters(ts, tr, defaults...)
// Apply context substitution from the taskrun
ts = resources.ApplyContexts(ts, rtr.TaskName, tr)
// Apply task result substitution
ts = resources.ApplyResults(ts)
// Apply step exitCode path substitution
ts = resources.ApplyStepExitCodePath(ts)
// Apply workspace resource substitution
// propagate workspaces from taskrun to task.
twn := []string{}
for _, tw := range ts.Workspaces {
twn = append(twn, tw.Name)
}
for _, trw := range tr.Spec.Workspaces {
skip := false
for _, tw := range twn {
if tw == trw.Name {
skip = true
break
}
}
if !skip {
ts.Workspaces = append(ts.Workspaces, v1.WorkspaceDeclaration{Name: trw.Name})
}
}
ts = resources.ApplyWorkspaces(ctx, ts, ts.Workspaces, tr.Spec.Workspaces, workspaceVolumes)
return ts, nil
}

3.2 ApplyWorkspaces -> ApplyReplacements

// ApplyWorkspaces applies the substitution from paths that the workspaces in declarations mounted to, the
// volumes that bindings are realized with in the task spec and the PersistentVolumeClaim names for the
// workspaces.
func ApplyWorkspaces(ctx context.Context, spec *v1.TaskSpec, declarations []v1.WorkspaceDeclaration, bindings []v1.WorkspaceBinding, vols map[string]corev1.Volume) *v1.TaskSpec {
stringReplacements := map[string]string{}
bindNames := sets.NewString()
for _, binding := range bindings {
bindNames.Insert(binding.Name)
}
for _, declaration := range declarations {
prefix := fmt.Sprintf("workspaces.%s.", declaration.Name)
if declaration.Optional && !bindNames.Has(declaration.Name) {
stringReplacements[prefix+"bound"] = "false"
stringReplacements[prefix+"path"] = ""
} else {
stringReplacements[prefix+"bound"] = "true"
spec = applyWorkspaceMountPath(prefix+"path", spec, declaration)
}
}
for name, vol := range vols {
stringReplacements[fmt.Sprintf("workspaces.%s.volume", name)] = vol.Name
}
for _, binding := range bindings {
if binding.PersistentVolumeClaim != nil {
stringReplacements[fmt.Sprintf("workspaces.%s.claim", binding.Name)] = binding.PersistentVolumeClaim.ClaimName
} else {
stringReplacements[fmt.Sprintf("workspaces.%s.claim", binding.Name)] = ""
}
}
return ApplyReplacements(spec, stringReplacements, map[string][]string{}, map[string]map[string]string{})
}

3.3 workspaces.%s.volume

for name, vol := range vols {
stringReplacements[fmt.Sprintf("workspaces.%s.volume", name)] = vol.Name
}

4. Set replaced taskSpec to TaskRun Status

tr.Status.TaskSpec = ts

5. reconcile -> prepare -> GetTaskFuncFromTaskRun

// GetTaskFuncFromTaskRun is a factory function that will use the given TaskRef as context to return a valid GetTask function.
// It also requires a kubeclient, tektonclient, namespace, and service account in case it needs to find that task in
// cluster or authorize against an external repositroy. It will figure out whether it needs to look in the cluster or in
// a remote image to fetch the reference. It will also return the "kind" of the task being referenced.
// OCI bundle and remote resolution tasks will be verified by trusted resources if the feature is enabled
func GetTaskFuncFromTaskRun(ctx context.Context, k8s kubernetes.Interface, tekton clientset.Interface, requester remoteresource.Requester, taskrun *v1.TaskRun, verificationPolicies []*v1alpha1.VerificationPolicy) GetTask {
// if the spec is already in the status, do not try to fetch it again, just use it as source of truth.
// Same for the RefSource field in the Status.Provenance.
if taskrun.Status.TaskSpec != nil {
return func(_ context.Context, name string) (*v1.Task, *v1.RefSource, *trustedresources.VerificationResult, error) {
var refSource *v1.RefSource
if taskrun.Status.Provenance != nil {
refSource = taskrun.Status.Provenance.RefSource
}
return &v1.Task{
ObjectMeta: metav1.ObjectMeta{
Name: name,
Namespace: taskrun.Namespace,
},
Spec: *taskrun.Status.TaskSpec,
}, refSource, nil, nil
}
}
return GetTaskFunc(ctx, k8s, tekton, requester, taskrun, taskrun.Spec.TaskRef, taskrun.Name, taskrun.Namespace, taskrun.Spec.ServiceAccountName, verificationPolicies)
}

if the spec is already in the status, do not try to fetch it again, just use it as source of truth.

Solution

1. The content stored in taskRun.status.taskSpec is the original content of taskSpec, rather than the values after variable substitution.

Advantages: It's very versatile and there is no need to worry about retries being affected even if similar random variables are introduced in the future.

Disadvantages: The information in taskus.taskSpec is no longer intuitive. It's impossible to directly see the actual variable values during execution.

2. Each time the taskSpec definition is acquired, it is fetched from the remote end, rather than reusing the one in the status.

Disadvantages: It needs to be fetched every time reconciliation occurs, which is not very practical.

3. Change the method of generating volume names

No longer be completely random, but rather to hash the original name. Make sure that the name generated each time is the same and is not duplicated within the current workspaces.

In the current taskSpec, only the value of workspace.<name>.volume may be different each time; the values of other variables are fixed and will not change anymore.

I personally prefer to fix it using this solution.

4. Remove support for workspace.<name>.volume

@l-qing l-qing added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2024
@l-qing
Copy link
Contributor Author

l-qing commented Apr 16, 2024

/assign

l-qing added a commit to l-qing/pipeline that referenced this issue Apr 16, 2024
…e>.volume`

fix tektoncd#7886

Change the naming method of the workspace volume from completely random to
hashed, ensuring that the name generated during a single taskRun lifecycle is
consistent each time, and is unique within all current workspaces.

This way, we can reuse the logic of retrieving the taskSpec from the status,
and also store the content after variable expansion in the taskSpec of the
status for easy debugging; it will also not affect the reconstruction of the
pod when retrying.
l-qing added a commit to l-qing/pipeline that referenced this issue Apr 16, 2024
…e>.volume`

fix tektoncd#7886

Change the naming method of the workspace volume from completely random to
hashed, ensuring that the name generated during a single taskRun lifecycle is
consistent each time, and is unique within all current workspaces.

This way, we can reuse the logic of retrieving the taskSpec from the status,
and also store the content after variable expansion in the taskSpec of the
status for easy debugging; it will also not affect the reconstruction of the
pod when retrying.
l-qing added a commit to l-qing/pipeline that referenced this issue Apr 23, 2024
…e>.volume`

fix tektoncd#7886

Change the naming method of the workspace volume from completely random to
hashed, ensuring that the name generated during a single taskRun lifecycle is
consistent each time, and is unique within all current workspaces.

This way, we can reuse the logic of retrieving the taskSpec from the status,
and also store the content after variable expansion in the taskSpec of the
status for easy debugging; it will also not affect the reconstruction of the
pod when retrying.
l-qing added a commit to l-qing/pipeline that referenced this issue Apr 24, 2024
…e>.volume`

fix tektoncd#7886

Change the naming method of the workspace volume from completely random to
hashed, ensuring that the name generated during a single taskRun lifecycle is
consistent each time, and is unique within all current workspaces.

This way, we can reuse the logic of retrieving the taskSpec from the status,
and also store the content after variable expansion in the taskSpec of the
status for easy debugging; it will also not affect the reconstruction of the
pod when retrying.
l-qing added a commit to l-qing/pipeline that referenced this issue Apr 25, 2024
…e>.volume`

fix tektoncd#7886

Change the naming method of the workspace volume from completely random to
hashed, ensuring that the name generated during a single taskRun lifecycle is
consistent each time, and is unique within all current workspaces.

This way, we can reuse the logic of retrieving the taskSpec from the status,
and also store the content after variable expansion in the taskSpec of the
status for easy debugging; it will also not affect the reconstruction of the
pod when retrying.
l-qing added a commit to l-qing/pipeline that referenced this issue Apr 25, 2024
…e>.volume`

fix tektoncd#7886

Change the naming method of the workspace volume from completely random to
hashed, ensuring that the name generated during a single taskRun lifecycle is
consistent each time, and is unique within all current workspaces.

This way, we can reuse the logic of retrieving the taskSpec from the status,
and also store the content after variable expansion in the taskSpec of the
status for easy debugging; it will also not affect the reconstruction of the
pod when retrying.
l-qing added a commit to katanomi/pipeline that referenced this issue Apr 29, 2024
…e>.volume`

fix tektoncd#7886

Change the naming method of the workspace volume from completely random to
hashed, ensuring that the name generated during a single taskRun lifecycle is
consistent each time, and is unique within all current workspaces.

This way, we can reuse the logic of retrieving the taskSpec from the status,
and also store the content after variable expansion in the taskSpec of the
status for easy debugging; it will also not affect the reconstruction of the
pod when retrying.
@tricktron
Copy link

@l-qing I am facing this issue as well: Is there any manual workaround? Will this be backported to older stable versions?

@l-qing
Copy link
Contributor Author

l-qing commented May 30, 2024

I am facing this issue as well: Is there any manual workaround? Will this be backported to older stable versions?

I didn't find a manual solution.
I thought it was a low probability issue, so I didn't consider reverting to the old stable version.
If you think it's necessary, you can ask @vdemeester for help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants