Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0069: Support retries for custom task in a pipeline. #4135

Closed
wants to merge 1 commit into from

Conversation

ScrapCodes
Copy link
Contributor

@ScrapCodes ScrapCodes commented Aug 3, 2021

Changes

  1. Add field Retries to RunSpec, an integer count which acts as a FYI to
    custom task controller.
  2. Add a new RunRetry, in addition to RunCancelled status to RunSpecStatus
    i.e. v1alpha1.RunSpecStatusRetry
  3. Add a field RetriesStatus to RunStatusFields, to maintain the retry
    history for a Run, similar to v1beta1.TaskRunStatusFields.RetriesStatus
  4. Add a config map entry (default-short-timeout-seconds) to config-defaults in
    order to make short timeout configurable.

Proposed algorithm for performing a retry for custom task.

  • Step 1. A pipelineTask consisting of a custom task X, is configured with
    retries count.

  • Step 2. On failure of task X, pipelinerun controller sees a request for a
    retry. It then communicates the same to custom task Run by patching
    /spec/status with a v1alpha1.RunSpecStatusRetry i.e. RunRetry. Similar
    to request a custom task to cancel.

  • Step 3. In addition to patching the pipelinerun controller also enqueue a timer
    time.After(default-short-timeout-seconds) (default: 30 seconds).
    On completion of timeout (i.e. 30s), it checks if /spec/status is RunRetry,
    then it assumes that custom task does not support retry.

    • a) if custom task does not supports retry as above, It sets no. of retry done
      to the retries count configured - i.e. exhaust all retries.
    • b) if custom task does support retry, update retry history.
  • Step 4. The custom task that wants to support the retry, has to update

    • a) status.conditions to indicate it is Running.
    • b) clear /spec/status if it is RunRetry.

A task may retry and immediately fail, so controller cannot fully rely on
status.conditions.

Docs changes

  1. Removed the limitation that retries are not supported for custom task.
  2. Add a short user guide for custom task developer.

/kind feature

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Docs included if any changes are user facing
  • Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Release notes block below has been filled in or deleted (only if no user facing changes)

Release Notes

Custom task can be configured with retry. e.g.

            spec:
              pipelineSpec:
                tasks:
                  - name: task-loop-with-retry
                    retries: 2 # retries parameter of a PipelineTask in Pipeline definition.
                    runAfter:
                      - first-task
                    params:
                      - name: word
                        value:
                          - jump
                          - land
                          - roll
                      - name: suffix
                        value: ing
                    taskRef:
                      apiVersion: custom.tekton.dev/v1alpha1
                      kind: TaskLoop
                      name: simpletaskloop

 1. Add field `Retries` to `RunSpec`, an integer count which acts as a FYI to
    custom task controller.
 2. Add a new `RunRetry`, in addition to `RunCancelled` status to `RunSpecStatus`
    i.e. `v1alpha1.RunSpecStatusRetry`
 3. Add a field `RetriesStatus` to `RunStatusFields`, to maintain the retry
    history for a `Run`, similar to `v1beta1.TaskRunStatusFields.RetriesStatus`
 4. Add a config map entry (default-short-timeout-seconds) to `config-defaults` in
order to make short timeout configurable.

 Proposed algorithm for performing a retry for custom task.

 - Step 1. A `pipelineTask` consisting of a custom task X, is configured with
   `retries` count.

 - Step 2. On failure of task X, `pipelinerun` controller sees a request for a
   retry. It then communicates the same to custom task `Run` by patching
   `/spec/status` with a `v1alpha1.RunSpecStatusRetry` i.e. `RunRetry`. Similar
   to request a custom task to cancel.

 - Step 3. In addition to patching the `pipelinerun` controller also enqueue a timer
   `time.After(default-short-timeout-seconds)` (default: 30 seconds).
   On completion of timeout (i.e. 30s), it checks if `/spec/status` is `RunRetry`,
   then it assumes that custom task does not support retry.
     - a) if custom task does not supports retry as above, It sets no. of `retry done`
     to the `retries` count configured - i.e. exhaust all retries.
     - b) if custom task does support retry, update retry history.

 - Step 4. The custom task that wants to support the retry, has to update
   - a) `status.conditions` to indicate it is `Running`.
   - b) clear `/spec/status` if it is `RunRetry`.

 _A task may retry and immediately fail, so controller cannot fully rely on
 `status.conditions`._

 Docs changes
 1. Removed the limitation that retries are not supported for custom task.
 2. Add a short user guide for custom task developer.
@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Aug 3, 2021
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign imjasonh
You can assign the PR to them by writing /assign @imjasonh in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 3, 2021
@ScrapCodes ScrapCodes marked this pull request as draft August 3, 2021 13:46
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/config/default.go 82.8% 73.5% -9.2
pkg/apis/pipeline/v1alpha1/run_types.go 43.8% 47.1% 3.3
pkg/apis/pipeline/v1beta1/pipeline_types.go 95.9% 95.8% -0.1
pkg/reconciler/pipelinerun/pipelinerun.go 83.4% 83.8% 0.4
pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go 92.2% 92.3% 0.1
pkg/reconciler/pipelinerun/resources/pipelinerunstate.go 73.4% 73.5% 0.0

@ScrapCodes
Copy link
Contributor Author

Currently, I am using time.AfterFunc in this PR. Which is not so ideal performance wise, once the PR #4131 is merged, I will be able to switch to NewRequeueKey .

@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 7, 2021
@tekton-robot
Copy link
Collaborator

@ScrapCodes: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ScrapCodes ScrapCodes closed this Sep 16, 2021
@afrittoli
Copy link
Member

Thanks @ScrapCodes - I look forward to the new PR then!

@ScrapCodes
Copy link
Contributor Author

Yeah @afrittoli, I have proposed a new approach see tektoncd/community#491

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants