Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use kmeta.ChildName for child resources #4361

Merged
merged 1 commit into from
Nov 17, 2021

Conversation

afrittoli
Copy link
Member

@afrittoli afrittoli commented Nov 4, 2021

Changes

When a reconciler generates a child resource, it risks creating a
duplicate resource because of stale informer cache. This may happen
for instance when two reconciliations of the same resource happen
almost at the same time.
To avoid this risk, children resources should use a name that is uniquely
associated to that of the parent, so that any attempt of recreation would
fail. Such failure, when it happens, is considered as a transient error,
and it does not cause the TaskRun or the PipelineRun to fail.

We use kmeta.ChildName to generate the names. This applies to Pods
generated by TaskRuns and TaskRuns and Runs generated by PipelineRuns.
ChildName combines a base name with a suffix, and if the resulting name is
too long it shortens it by truncating it and appending an hash of the
original name: https://pkg.go.dev/github.com/knative/pkg/kmeta#ChildName

Resources that are not k8s resources, such as steps or scripts, are not
affected by this change.

Fixes #4358

Signed-off-by: Andrea Frittoli [email protected]

/kind bug

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Docs included if any changes are user facing
  • Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

Name of `Pods` and  `TaskRuns` created by Tekton controllers are now generated with [`kmeta.ChildName`](https://pkg.go.dev/github.com/knative/pkg/kmeta#ChildName):

> ChildName generates a name for the resource based upon the parent resource and suffix. If the concatenated name is longer than K8s permits the name is hashed and truncated to permit construction of the resource, but still keeps it unique. If the suffix itself is longer than 31 characters, then the whole string will be hashed and `parent|hash|suffix` will be returned, where parent and suffix will be trimmed to fit (prefix of parent at most of length 31, and prefix of suffix at most length 30).

The name of the `Pods` owned by a `TaskRun`  is univocally associated to the owning resource. If a `TaskRun` resource is deleted and created with the same name, the child `Pod` will be created with the same name as before.
The base format of the name is `<taskrun-name>-pod`. The name may vary as described above if it becomes too long.
In case of retries, starting from the first retry, the base format of the name is `<taskrun-name>-retry<N>` where `N` is the retry number.

The name of the `TaskRuns` and `Runs` owned by a `PipelineRun`  are univocally associated to the owning resource. If a `PipelineRun` resource is deleted and created with the same name, the child `TaskRuns` will be created with the same name as before.
The base format of the name is `<pipelinerun-name>-<pipelinetask-name>`. The name may vary as described above if it becomes too long.

@tekton-robot tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Nov 4, 2021
@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 4, 2021
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 93.0% 91.8% -1.2

@imjasonh
Copy link
Member

imjasonh commented Nov 4, 2021

This looks great! What's WIP about it? (aside from unit tests 😈 )

@afrittoli
Copy link
Member Author

This looks great! What's WIP about it? (aside from unit tests 😈 )

Only that I need to fix the tests first :P

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 5, 2021
@afrittoli
Copy link
Member Author

There may be something to check about retries:

        status:
          completionTime: "2021-11-05T12:16:21Z"
          conditions:
          - lastTransitionTime: "2021-11-05T12:16:21Z"
            message: 'failed to create task run pod "retry-pipeline-retry-me": pods "retry-pipeline-retry-me-pod"
              already exists. Maybe invalid TaskSpec'
            reason: CouldntGetTask
            status: "False"
            type: Succeeded

Also I need to make sure that failures like this are considered transient and do not cause taskruns or pipelineruns to fail.

@afrittoli afrittoli force-pushed the issues/4358 branch 2 times, most recently from 1ae95c5 to 15732f2 Compare November 5, 2021 14:55
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/pod.go 88.5% 88.8% 0.3

@afrittoli
Copy link
Member Author

There may be something to check about retries:

        status:
          completionTime: "2021-11-05T12:16:21Z"
          conditions:
          - lastTransitionTime: "2021-11-05T12:16:21Z"
            message: 'failed to create task run pod "retry-pipeline-retry-me": pods "retry-pipeline-retry-me-pod"
              already exists. Maybe invalid TaskSpec'
            reason: CouldntGetTask
            status: "False"
            type: Succeeded

Also I need to make sure that failures like this are considered transient and do not cause taskruns or pipelineruns to fail.

For retries, the PipelineRun controller saves the original TaskRun status in the list of retries, and it clears the TaskRun status, so that the TaskRun controllers sees a TaskRun that needs to be reconciled, and triggers a new Pod creation. The Pod name must include the number of retry, at least starting from the first retry on.

@afrittoli afrittoli changed the title WIP Use kmeta.ChildName for child resources Use kmeta.ChildName for child resources Nov 5, 2021
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 5, 2021
@afrittoli
Copy link
Member Author

/test pull-tekton-pipeline-alpha-integration-tests

@afrittoli
Copy link
Member Author

I think the failures on pull-tekton-pipeline-alpha-integration-tests may be flakes.
/cc @imjasonh @mattmoor @vdemeester

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/pod.go 88.5% 88.8% 0.3

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/pod.go 88.5% 88.8% 0.3

@afrittoli
Copy link
Member Author

/test pull-tekton-pipeline-alpha-integration-tests

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 8, 2021
@afrittoli afrittoli added this to the Pipelines v0.30 milestone Nov 8, 2021
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/pod.go 88.5% 88.8% 0.3

@ghost
Copy link

ghost commented Nov 8, 2021

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 8, 2021
@ghost
Copy link

ghost commented Nov 8, 2021

The run-steps-as-non-root Example failed with:

        >>> Container step-show-user-1001:
        Signal 23 (URG) caught by ps (3.3.16).
        ps:ps/display.c:66: please report this bug

It looks like this has been recently fixed in ps by ignoring the signal so probably nothing for us to do here. Re-running.

/test pull-tekton-pipeline-integration-tests

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @afrittoli 🎉

added a comment documenting the change to Runs as well

@@ -572,8 +573,7 @@ func getRunName(runsStatus map[string]*v1beta1.PipelineRunRunStatus, ptName, prN
return k
}
}

return names.SimpleNameGenerator.RestrictLengthWithRandomSuffix(fmt.Sprintf("%s-%s", prName, ptName))
return kmeta.ChildName(prName, fmt.Sprintf("-%s", ptName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like you've made the same change to Runs as well, please documentation for that like you've done for TaskRuns and Pods

adding that detail to the commit message and pr description would be great too!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thank you. Fixed.

@pritidesai
Copy link
Member

Thanks @afrittoli for this PR 🙏 I think this also fixes (both were closed but I just reopened them since the issues are still seen with recent releases) :

#3126
#3557

@pritidesai
Copy link
Member

I might be missing something but can we please add some tests around the release note if they don't exist already:

> ChildName generates a name for the resource based upon the parent resource and suffix. If the concatenated name is longer than K8s permits the name is hashed and truncated to permit construction of the resource, but still keeps it unique. If the suffix itself is longer than 31 characters, then the whole string will be hashed and `parent|hash|suffix` will be returned, where parent and suffix will be trimmed to fit (prefix of parent at most of length 31, and prefix of suffix at most length 30).

Documenting some samples would be ideal for the folks creating pipelineRuns 🙏 .

@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2021
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/pod.go 88.5% 88.8% 0.3
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 93.0% 93.8% 0.8

@afrittoli
Copy link
Member Author

I might be missing something but can we please add some tests around the release note if they don't exist already:

> ChildName generates a name for the resource based upon the parent resource and suffix. If the concatenated name is longer than K8s permits the name is hashed and truncated to permit construction of the resource, but still keeps it unique. If the suffix itself is longer than 31 characters, then the whole string will be hashed and `parent|hash|suffix` will be returned, where parent and suffix will be trimmed to fit (prefix of parent at most of length 31, and prefix of suffix at most length 30).

Thank you - the logic of the ChildName function is tested on knative side, which is why I did not add tests.
I added some tests for that now so that if the logic is changed on the library side we will notice it and have a chance to update docs along with the tests accordingly.

Documenting some samples would be ideal for the folks creating pipelineRuns 🙏 .

Good idea, done!

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples_test.go:144: exit status 1 Output: Error from server (InternalError): error when creating "STDIN": Internal error occurred: failed calling webhook "webhook.pipeline.tekton.dev": Post "https://tekton-pipelines-webhook.tekton-pipelines.svc:443/defaulting?timeout=10s": context deadline exceeded

/test pull-tekton-pipeline-alpha-integration-tests

docs/pipelineruns.md Outdated Show resolved Hide resolved
Copy link
Member

@pritidesai pritidesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @afrittoli

a couple of minor nits but otherwise looks great, thank you for fixing this 🙏

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pritidesai

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 16, 2021
When a reconciler generates a child resource, it risks creating a
duplicate resource because of stale informer cache. This may happen
for instance when two reconciliations of the same resource happen
almost at the same time.
To avoid this risk, children resources should use a name that is uniquely
associated to that of the parent, so that any attempt of recreation would
fail. Such failure, when it happens, is considered as a transient error,
and it does not cause the TaskRun or the PipelineRun to fail.

We use kmeta.ChildName to generate the names. This applies to Pods
generated by TaskRuns and TaskRuns and Runs generated by PipelineRuns.
ChildName combines a base name with a suffix, and if the resulting name is
too long it shortens it by truncating it and appending an hash of the
original name: https://pkg.go.dev/github.com/knative/pkg/kmeta#ChildName

Resources that are not k8s resources, such as steps or scripts, are not
affected by this change.

Fixes tektoncd#4358

Signed-off-by: Andrea Frittoli <[email protected]>
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/pod/pod.go 88.5% 88.8% 0.3
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 93.0% 93.8% 0.8

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 17, 2021
@tekton-robot tekton-robot merged commit 277dd9e into tektoncd:main Nov 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicated pod for a single TaskRun
7 participants