Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP 0097: Breakpoints for TaskRun and PipelineRuns #572

Merged
merged 1 commit into from
Jul 12, 2022

Conversation

waveywaves
Copy link
Member

@waveywaves waveywaves commented Dec 9, 2021

This proposal builds on top of TEP-0042. It outlines the implementation details for attaching breakpoints against steps in TaskRuns and breakpoints against Tasks in PipelineRuns.

todo:

  • Add details on PipelineRun timeout
  • Implications for Dependent TaskRuns
  • Expound on the usage of /tekton/run

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 9, 2021
@waveywaves waveywaves force-pushed the update/debug-pipelinerun branch 2 times, most recently from f57840e to 64f4873 Compare December 10, 2021 02:47
@waveywaves
Copy link
Member Author

/assign vdemeester

@jerop
Copy link
Member

jerop commented Dec 16, 2021

/assign

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed proposal @waveywaves!

Data Structure

As discussed with @bobcatfish, our main concern is that the API change is encoded into Strings. Prefixing the name or index of a Step with before or after specializes some strings and restricts their use. Instead, we can use the appropriate data structure that supports the type needed here. I see that in TEP-0042: TaskRun Breakpoint on Failure of Step, there was an alternative that has the data structure to support this (we can explore other alternatives too):

debug:
  breakpoint: 
    onFailure: true
    beforeStep: []
    afterStep: []

Names vs Indexes

Another concern that came up in an API WG discussion was the use of Step indexes which may change if a Task is updated without bumping its version (bumping policy to be addressed in tektoncd/catalog#784). In the meantime, can we please scope down this TEP to use the name of the Steps only (not the indexes)? Given that there are some Steps that are unnamed, we can make it explicit that this functionality is only supported in named Steps -- and we can explore supporting unnamed Steps in future work (preferably after we establish the version bumping policy).

@dibyom dibyom added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Dec 20, 2021
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 21, 2021
@waveywaves
Copy link
Member Author

waveywaves commented Dec 31, 2021

@jerop Thank you for the review.

Prefixing the name or index of a Step with before or after specializes some strings and restricts their use. Instead, we can use the appropriate data structure that supports the type needed here. I see that in TEP-0042: TaskRun Breakpoint on Failure of Step, there was an alternative that has the data structure to support this (we can explore other alternatives too):

debug:
 breakpoint: 
   onFailure: true
   beforeStep: []
   afterStep: []

I agree. Prefixing stepNames with "before" or "after" for breakpointing would disallow the user from using certain strings as step/task name eg: before-build etc. +1 for using the proposed alternative of the breakpoint datastructure.

Another concern that came up in an API WG discussion was the use of Step indexes which may change if a Task is updated without bumping its version (bumping policy to be addressed in tektoncd/catalog#784). In the meantime, can we please scope down this TEP to use the name of the Steps only (not the indexes)? Given that there are some Steps that are unnamed, we can make it explicit that this functionality is only supported in named Steps -- and we can explore supporting unnamed Steps in future work (preferably after we establish the version bumping policy).

That makes sense. Using indexes does add some ambiguity in the mix of which step/task might be getting used (especially if the tasks are getting unversioned updates). +1 for scoping this proposal down to referencing step/task names using just their names initially.

@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2022
@waveywaves
Copy link
Member Author

@jerop I have updated the proposal let me know what you think

waveywaves added a commit to waveywaves/tekton-pipeline that referenced this pull request Jan 20, 2022
based on
tektoncd/community#572 (review)
---
the onFailure breakpoint is updated to be a bool to avoid considering
"onFailure" as reserved string.
waveywaves added a commit to waveywaves/tekton-pipeline that referenced this pull request Jan 20, 2022
based on
tektoncd/community#572 (review)
---
the onFailure breakpoint is updated to be a bool to avoid considering
"onFailure" as reserved string.
waveywaves added a commit to waveywaves/tekton-pipeline that referenced this pull request Jan 24, 2022
based on
tektoncd/community#572 (review)
---
the onFailure breakpoint is updated to be a bool to avoid considering
"onFailure" as reserved string.
waveywaves added a commit to waveywaves/tekton-pipeline that referenced this pull request Jan 24, 2022
based on
tektoncd/community#572 (review)
---
the onFailure breakpoint is updated to be a bool to avoid considering
"onFailure" as reserved string.
jerop added a commit to jerop/catalog that referenced this pull request Jan 31, 2022
Today, we don't always bump versions of resources when updating them.
The guideline has been to bump versions only when behavior changes,
but it's hard to figure out when the behavior has changed (a change
that could be trivial to one user could be meaningful to another).

Not bumping resource versions when changing them causes issues where
the resource definition becomes dependent on the time when it was applied
 by the user - which causes unexpected failures as described in
tektoncd#784.

This issue also came up as an issue where users cannot depend on the
Step indices because they can change:
tektoncd/community#572 (review).

In TEP-0003, we already proposed that a policy for versioning of resources:
https://github.com/tektoncd/community/blob/main/teps/0003-tekton-catalog-organization.md#versioning-resources

In Catalog Working Group on 01/13/2022, we revisted that policy and:
- agreed to follow the versioning policy
- make it easier to bump resources (see tektoncd/plumbing#994)

This change documents the versioning policy.

Fxes tektoncd#784
jerop added a commit to jerop/catalog that referenced this pull request Jan 31, 2022
Today, we don't always bump versions of resources when updating them.
The guideline has been to bump versions only when behavior changes,
but it's hard to figure out when the behavior has changed (a change
that could be trivial to one user could be meaningful to another).

Not bumping resource versions when changing them causes issues where
the resource definition becomes dependent on the time when it was applied
 by the user - which causes unexpected failures as described in
tektoncd#784.

This issue also came up as an issue where users cannot depend on the
Step indices because they can change:
tektoncd/community#572 (review).

In TEP-0003, we already proposed that a policy for versioning of resources:
https://github.com/tektoncd/community/blob/main/teps/0003-tekton-catalog-organization.md#versioning-resources

In Catalog Working Group on 01/13/2022, we revisted that policy and:
- agreed to follow the versioning policy
- make it easier to bump resources (see tektoncd/plumbing#994)

This change documents the versioning policy in the contributions guide.

Fixes tektoncd#784
jerop added a commit to jerop/catalog that referenced this pull request Jan 31, 2022
Today, we don't always bump versions of resources when updating them.
The guideline has been to bump versions only when behavior changes,
but it's hard to figure out when the behavior has changed (a change
that could be trivial to one user could be meaningful to another).

Not bumping resource versions when changing them causes issues where
the resource definition becomes dependent on the time when it was applied
 by the user - which causes unexpected failures as described in
tektoncd#784.

This issue also came up as an issue where users cannot depend on the
Step indices because they can change:
tektoncd/community#572 (review).

In TEP-0003, we already proposed that a policy for versioning of resources:
https://github.com/tektoncd/community/blob/main/teps/0003-tekton-catalog-organization.md#versioning-resources

In Catalog Working Group on 01/13/2022, we revisited that policy and:
- agreed to follow the versioning policy
- make it easier to bump resources (see tektoncd/plumbing#994)

This change documents the versioning policy in the contributions guide.

Fixes tektoncd#784
@jerop jerop mentioned this pull request Jan 31, 2022
3 tasks
@tekton-robot tekton-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 1, 2022
Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, and sorry it took so long to review on my side 😓

I think the proposal looks solid, I have a few comments but they are more related to the format of the TEP than to the essence of the design.

From the TEP it looks like debug is entirely new functionality, while some part of it is already implemented, so maybe you could distinguish the two parts in the TEP so the scope is more clear.

I see no mention about the timeout implications, do you plan to add those in a follow-up PR?

/approve

- [CLI integration](#cli-integration)
- [Environment access](#environment-access)
- [Alternatives](#alternatives)
-<!-- /toc -->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: empty -

Comment on lines 53 to 78
```yaml
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
generateName: quarante-deux-deux
spec:
debug:
breakpoint:
onFailure: true
beforeStep: ["task-build-archive"]
afterStep: ["deploy-frontend"]
```

```yaml
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
generateName: quarante-deux
spec:
debug:
breakpoint:
onFailure: true
beforeStep: ["test-js"]
afterStep: ["push-to-registry"]
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more like the proposed API design than the goal :)
Maybe you could move this to a section further down.

You can use goals and non-goals to define the scope you have in mind for the TEP.


## Requirements

* To enable breakpoint on failure of a step, it would be necessary to update the Entrypoint to support lifecycle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: the steps required for implementation should be further down in the proposal.
Requirements could be something like:

  • being able to set a breakpoint on failure for a step
  • being able to set a breakpoint before a step

etc...

The `breakpoint` spec would contain the `beforeStep` and `afterStep` specs which are arrays containing names of steps
which the user plans on debugging.

_This functionality will only support named steps_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - I think this is a fair assumption. We had a similar discussion about resources, and we want to document setting the name for steps as the recommended best practice.

Comment on lines 62 to 64
beforeStep: ["task-build-archive"]
afterStep: ["deploy-frontend"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this before|afterStep or before|afterTask ?

Comment on lines 109 to 79
beforeStep: ["test-js"]
afterStep: ["push-to-registry"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both before and after be specified, and they're both lists, correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe renaming them to beforeSteps and afterSteps could make it clear that they are lists?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done !

Comment on lines +166 to +151
`/tekton/debug/scripts/debug-continue` : Mark the step as completed with success by writing to `/tekton/run/<step-no.>/`. eg: User wants to mark
failed step 0 as a success. Running this script would create `/tekton/run/0/out`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, so even in case of breakpoint on failure, the user can simulate success and continue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep !

`/tekton/debug/info/<n>` : Contains information about the step. Single EmptyDir shared between all step containers, but renamed
to reflect step number. eg: Step 0 will have `/tekton/debug/info/0`, Step 1 will have `/tekton/debug/info/1` etc.

#### Debug Scripts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the scripts below talk about a failed step 0 - are these scripts only meant to be used for the case of breakpoint on failure? Can they be used for other breakpoint types as well, with similar effects?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be possible to use them in other scenarios as well.

afterTask: ["deploy-frontend"]
```

### Breakpoint before or after a TaskRun
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :)

Since a pipeline may be running several tasks in parallel, in case of breakpoints, multiple ones could be hit in parallel, and the user will have the opportunity of interacting with all the tasks under debug.
Tasks that do not depend on any task being debugged will continue to run and be scheduled normally, correct?

Copy link
Member Author

@waveywaves waveywaves Jul 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasks that do not depend on any task being debugged will continue to run and be scheduled normally, correct?

Yes. Those tasks will continue running normally.

Copy link
Member Author

@waveywaves waveywaves Jul 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added your comment as a statement under Dependent and Independent TaskRuns in this proposal.

for the TaskRun to stop executing (due to failure) and go into the limbo state which would be leveraged for debugging.
The CLI will open a shell to the step container which would be a reimplementation of `kubectl exec -it`.

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice to see alternatives but there's no pros and cons specified for them, so it's not clear why you picked the other option? I think the proposed option looks good, but it would be good to call out pros / cons for these, or if none is available mention it in a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll add these.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done ! ✔️

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2022
@tekton-robot tekton-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2022
@waveywaves waveywaves force-pushed the update/debug-pipelinerun branch 2 times, most recently from 9c79355 to c961121 Compare July 10, 2022 11:01
@waveywaves
Copy link
Member Author

Finally updated this TEP after ages.
@jerop @vdemeester @afrittoli If you would be so kind would love to have your eyes on this one. I have added the timeout implications, changed breakpoint to breakpoints to not break previous API behavior.

@waveywaves waveywaves force-pushed the update/debug-pipelinerun branch 3 times, most recently from 6578c0b to 3a3bcf3 Compare July 10, 2022 11:47
Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waveywaves thanks for the updates - the API is much clearer 🎉

left some comments to clarify that we're adding support for debugging pipelineruns - but overall looks good to me

Comment on lines 44 to 47
Debugging TaskRuns can be tiresome. Re-running Tasks to figure out which part of a particular step is to blame would tax productivity.
By enabling breakpoint on failure for a TaskRun we should be able to halt a TaskRun at the failing step and get access to the
step environment to analyze cause of the failure. Unlike legacy systems, this allows the user to debug TaskRuns while the TaskRun
is still running, hence improving developer productivity and Pipeline debuggability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waveywaves this summary is focused on taskruns - which was addressed in https://github.com/tektoncd/community/blob/main/teps/0042-taskrun-breakpoint-on-failure.md - seems copy pasted from there?

we need to update it to reflect that this TEP proposes debugging for pipelineruns

Comment on lines 59 to 60
This enhancement can allow a more interactive way of working with TaskRuns. CLI or IDE extensions could allow users to run a
Task with breakpoint-on-failure and while the logs are being streamed, the user can be dropped to a shell in the breakpoint
failed step and allow the user to debug as they develop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and pipelineruns?

The `breakpoint` spec would contain the `beforeStep` and `afterStep` specs which are arrays containing names of steps
which the user plans on debugging.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `breakpoint` spec would contain the `beforeStep` and `afterStep` specs which are arrays containing names of steps
which the user plans on debugging.
The `breakpoint` spec would contain the `beforeSteps` and `afterSteps` specs which are arrays containing names of steps
which the user plans on debugging.
Suggested change
The `breakpoint` spec would contain the `beforeStep` and `afterStep` specs which are arrays containing names of steps
which the user plans on debugging.
The `breakpoint` spec would contain the `beforeStep` and `afterStep` specs which are arrays containing names of steps
which the user plans on debugging.

authors:
- "@waveywaves"
creation-date: 2021-12-10
last-updated: 2021-12-10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update this date

- "@waveywaves"
creation-date: 2021-12-10
last-updated: 2021-12-10
status: proposed
Copy link
Member

@jerop jerop Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can mark the tep as implementable given that the design is included in this change

@pritidesai
Copy link
Member

API WG - ready to merge this offline after the TEP is updated, we have enough approval!

@waveywaves waveywaves force-pushed the update/debug-pipelinerun branch 3 times, most recently from 37ffbf7 to b0a115b Compare July 12, 2022 05:15
@waveywaves
Copy link
Member Author

@jerop I have added the updates based on your comments. The summary indeed was a copy 😝 I have added a more relevant one and updated the use cases to reflect changes introduced in this PR and pipelineruns. Changed before|afterSteps -> before|afterStep. Updated the date to today's and status to implementable.

@jerop
Copy link
Member

jerop commented Jul 12, 2022

@jerop I have added the updates based on your comments. The summary indeed was a copy 😝 I have added a more relevant one and updated the use cases to reflect changes introduced in this PR and pipelineruns. Changed before|afterSteps -> before|afterStep. Updated the date to today's and status to implementable.

@waveywaves thanks for making the updates - i intended the change to be before|afterStep -> before|afterSteps, the same way that we have before|afterTasks, because it's an array

@waveywaves
Copy link
Member Author

@jerop just updated. I misread your comment earlier then.

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@waveywaves thank you!

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2022
@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, jerop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This proposal builds on top of TEP-0042. It outlines the implementation
details for attaching breakpoints against steps in TaskRuns and
breakpoints against Tasks in PipelineRuns.
@tekton-robot tekton-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2022
@jerop
Copy link
Member

jerop commented Jul 12, 2022

/lgtm

@wuhuizuo
Copy link

wuhuizuo commented Nov 1, 2023

Is it implemented? tektoncd/pipeline#5184 was closed by bot without any pull requests.

@vdemeester
Copy link
Member

@wuhuizuo not yet probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: UnAssigned
Status: Implementable
Development

Successfully merging this pull request may close these issues.

9 participants