From 2729f73438521949029f1114695dd8aafaeff175 Mon Sep 17 00:00:00 2001 From: Priti Desai Date: Thu, 7 Jan 2021 12:11:38 -0800 Subject: [PATCH] tep to ignore step error Proposing a tep to ignore step error and provide an option to continue after capturing the non zero exit code. Also document the container termination state to access it after the pipeline exectution finishes. --- teps/0040-ignore-step-error.md | 176 +++++++++++++++++++++++++++++++++ teps/README.md | 1 + 2 files changed, 177 insertions(+) create mode 100644 teps/0040-ignore-step-error.md diff --git a/teps/0040-ignore-step-error.md b/teps/0040-ignore-step-error.md new file mode 100644 index 000000000..1b51153d8 --- /dev/null +++ b/teps/0040-ignore-step-error.md @@ -0,0 +1,176 @@ +--- +status: proposed +title: 'Ignore Step Error' +creation-date: '2021-01-06' +last-updated: '2021-02-02' +authors: +- '@pritidesai' +- '@afrittoli' +--- + +# TEP-0040: Ignore Step Error + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Requirements](#requirements) + - [Use Cases](#use-cases) +- [References](#references) + + +## Summary + +Tekton tasks are defined as a collection of steps in which each step can specify a container image to run. +Steps are executed in order in which they are specified. One single step failure results in a task failure +i.e. once a step results in a failure, rest of the steps are not executed. When a container exits with +non-zero exit code, the step results in error: + +```yaml +$ kubectl get tr failing-taskrun-hw5xj -o json | jq .status.steps +[ + { + "container": "step-failing-step", + "imageID": "...", + "name": "failing-step", + "terminated": { + "containerID": "...", + "exitCode": 244, + "finishedAt": "2021-02-02T18:27:46Z", + "reason": "Error", + "startedAt": "2021-02-02T18:27:46Z" + } + } +] +``` + +`TaskRun` with such step error, stops executing subsequent steps and results in a failure: + +```yaml +$ kubectl get tr failing-taskrun-hw5xj -o json | jq .status.conditions +[ + { + "lastTransitionTime": "2021-02-02T18:27:47Z", + "message": "\"step-failing-step\" exited with code 244 (image: \"..."); for logs run: kubectl -n default logs failing-taskrun-hw5xj-pod-wj6vn -c step-failing-step\n", + "reason": "Failed", + "status": "False", + "type": "Succeeded" + } +] +``` + +If such a task with a failing step is part of a pipeline, `pipelineRun` stops executing subsequent steps in that task +(similar to `taskRun`) and stops executing any other task in the pipeline and results in a pipeline failure. + +```yaml +$ kubectl get pr pipelinerun-with-failing-step-csmjr -o json | jq .status.conditions +[ + { + "lastTransitionTime": "2021-02-02T18:51:15Z", + "message": "Tasks Completed: 1 (Failed: 1, Cancelled 0), Skipped: 3", + "reason": "Failed", + "status": "False", + "type": "Succeeded" + } +] +``` + +Many common tasks have requirement where a step failure must not stop executing rest of the steps. +In order to continue executing subsequent steps, task authors have flexibility of wrapping an image and +exiting that step with success. This changes the failing step into success and does not block further +execution. But this is a workaround and only works with images which can be wrapped: + +```shell + steps: + - image: docker.io/library/golang:latest + name: ignore-unit-test-failure + script: | + go test . + TEST_EXIT_CODE=$? + if [ $TEST_EXIT_CODE != 0 ]; then + exit 0 + fi +``` + +This workaround does not apply to off-the-shelf container images. + +Similarly, many pipelines have requirement of continue executing rest of the tasks in a pipeline by stopping the +failure of such a task in that pipeline. + +As a pipeline execution engine, we want to support off-the-shelf container image as a step and provide +an option to ignore such step error. The task author can choose to continue execution, capture original non-zero +exit code, and make it available for the rest of the steps in that task. Also, provide an option to a pipeline author +to continue executing rest of the tasks by ignoring a step failure and allow accessing original non-zero exit code of +that step from rest of the tasks. + +Issue: [tektoncd/pipeline#2800](https://github.com/tektoncd/pipeline/issues/2800) + + +## Motivation + +It should be possible to easily use off-the-shelves (OTS) images as steps in Tekton tasks. A task author has no +control on the image but may desire to ignore an error and continue executing rest of the steps. + +One more motivation for this proposal is to expose step level failure at the pipeline level to support tasks from +the catalog. Allowing configuring step level failures at the pipeline authoring time opens up a possibility for +the pipeline author to utilize the catalog when the author has no control over the catalog of tasks. + +**Note:** Both motivations might bring separate API changes (former at the task level, and later at the pipeline level) +but the changes must be compatible with each other. + +### Goals + +Design a step failure strategy so that the task author can control the behaviour of an image and decide to +continue executing rest of the steps in the task. + +Prevent a task from failing when a step fails. + +Store the container termination state or error state and make it accessible to rest of the steps in a task. + +after the task finishes execution. + +This proposal must be applicable to any container image including custom images and off-the-shelf images. + +### Non-Goals + +This design is limited to a step within a task and does not apply to pipeline tasks. + +## Requirements + +* Users should be able to use prebuilt images as-is without having to do one or more of the following + (see also [TEP-0011](https://github.com/tektoncd/community/blob/master/teps/0011-redirecting-step-output-streams.md)): + * Investigating how they are built to understand if they contain a shell and possibly overriding the entrypoint + * Build and maintain their own images (i.e. add in required shell or other binaries) from those images + +* It should be possible to know that a step was allowed to fail by observing the status of the `TaskRun` + (and `PipelineRun` if applicable) (e.g. to show a "warning" / display as "yellow" status in a UI) + +* When a step is allowed to fail, the exit code of the process that failed should not be lost and should at a minimum be + available in the status of the `TaskRun` (and `PipelineRun` if applicable). + + +### Use Cases + +* As a task author, I would like to design a task with multiple steps. One of the steps is running an +enterprise image to run unit tests, and the next step needs to report test results even after a previous +step results in failure due to tests failure. + +* Allow migrating scripts and automations from other CI/CD systems that allowed image failures. + +* A [platform team](https://github.com/tektoncd/community/blob/master/user-profiles.md#1-pipeline-and-task-authors) + wants to share a `Task` to their team which runs the following steps in sequence: + * Run unit tests (which may fail) + * Apply a mutation to the test results (e.g. converts them to a certain format such as junit) + * Upload the results to a central location used by all the teams + +* As a pipeline author, I would like to utilize shared `task` (which may result in step error) and configure the pipeline + to ignore such step error. + + +## References + +* [Capture Exit Code, tektoncd/pipeline#2800](https://github.com/tektoncd/pipeline/issues/2800) +* [Add a field to Step that allows it to ignore failed prior Steps *within the same Task, tektoncd/pipeline#1559](https://github.com/tektoncd/pipeline/issues/1559) +* [Scott's Changes to allow steps to run regardless of previous step errors](https://github.com/tektoncd/pipeline/pull/1573) +* [Christie's Notes](https://docs.google.com/document/d/11wygsRe2d4G-wTJMddIdBgSOB5TpsWCqGGACSXusy_U/edit?resourcekey=0-skOAYQiz0xIktxYxCm-SFg) - Thank You, Christie! diff --git a/teps/README.md b/teps/README.md index 43635cc10..17d7529f3 100644 --- a/teps/README.md +++ b/teps/README.md @@ -148,4 +148,5 @@ This is the complete list of Tekton teps: |[TEP-0035](0035-document-tekton-position-around-policy-authentication-authorization.md) | document-tekton-position-around-policy-authentication-authorization | implementable | 2020-12-09 | |[TEP-0036](0036-start-measuring-tekton-pipelines-performance.md) | Start Measuring Tekton Pipelines Performance | proposed | 2020-11-20 | |[TEP-0037](0037-remove-gcs-fetcher.md) | Remove `gcs-fetcher` image | implementing | 2021-01-27 | +|[TEP-0040](0040-ignore-step-error.md) | Ignore Step Error | proposed | 2021-02-02 | |[TEP-0045](0045-whenexpressions-in-finally-tasks.md) | WhenExpressions in Finally Tasks | implementable | 2021-01-28 |