Design and implement timeouts #222

bobcatfish · 2018-11-03T00:36:38Z

Expected Behavior

A user should be able to specify timeouts at the:

Pipeline level
Task level

If a user does not specify timeouts, then some default should be used and it should be clear what that is:

Maybe infinite for a Pipeline
Tasks can use the same default as Builds

Actual Behavior

Currently this is possible via the build spec for Tasks only.

Additional Info

@abayer

dwnusbaum · 2018-11-05T15:29:15Z

In Jenkins Pipeline, users are able to specify whether a timeout is absolute or relative to the last log activity (activity parameter here). I think the relative option is mostly used for long builds where an absolute timeout would be too long to be useful, but the build is expected to produce log output very consistently. In that case, a timeout after 5 minutes of no log activity can be helpful to make sure builds are not stuck doing nothing for hours before they fail.

I'm not necessarily convinced that relative timeouts are necessary, or if log output is the best way to determine activity, but it seems like something worth discussing.

abayer · 2018-11-05T15:32:39Z

I'm personally not a huge fan of the relative timeouts in Jenkins - it seems like a useful thing, but I find that in practice, the use cases for "timeout after inactivity" that aren't adequately covered by "timeout after absolute amount of time" are pretty negligible, and probably not worth the extra effort to implement/support. But that's just my two cents.

abayer · 2018-11-20T17:03:11Z

Now that I've spent a bit more time digging into the actual code - looks like this already exists for Task, via Build, so we'll just need to add the same thing to Pipeline, yes?

vdemeester · 2018-11-27T15:15:52Z

/assign @abayer

knative-prow-robot · 2018-11-27T15:15:54Z

@vdemeester: GitHub didn't allow me to assign the following users: abayer.

Note that only knative members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @abayer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abayer · 2018-11-27T15:41:33Z

So from discussion in Slack, it feels like the first step is a more general "cancel this running PipelineRun" mechanism, which would be utilized both for timing out the PipelineRun and for manually canceling a running PipelineRun without deleting the PipelineRun/TaskRun/etc.

This would probably entail a new Canceled (or Cancelled, I ain't gonna get into US/UK spelling religious wars here!) Reason status for all of PipelineRun, TaskRun, and Build. When the PipelineRun times out or is manually canceled, its status Reason goes to Canceled, and the PipelineRun reconciler would proceed to set all its TaskRuns to the Canceled Reason status as well. And then the TaskRun controller would set its Build's Reason status to Canceled, and finally the Build controller would see the new Canceled Reason status and terminate its pod, same as it does currently for Build's Timeout configuration.

So putting aside the implementation details there, I'm also trying to determine how this would work in PipelineRun's reconciler - if the Reason status was set to Canceled manually, then, sure, we just go do all that stuff. But for timing out the PipelineRun, we'd need to calculate the PipelineRun's status before we check for whether we've timed out - since we don't want to timeout an actually completed PipelineRun. That'd just be goofy. But we do want to do the timeout check before we get the next task, since what's the point in getting a task we're just going to cancel anyway, right? As of now, we calculate the full status after we get the next task - would we end up needing to do that twice, once before the timeout check and again after we get the next task (assuming we haven't timed out)?

abayer · 2018-11-27T15:43:50Z

Oh, and that all implicitly requires changes in Build's reconciler - which raises the question of whether to worry about doing that now or just waiting until Build has been folded into here.

vdemeester · 2018-11-27T16:21:12Z

But for timing out the PipelineRun, we'd need to calculate the PipelineRun's status before we check for whether we've timed out - since we don't want to timeout an actually completed PipelineRun. That'd just be goofy.

Yes, I'm pretty sure we already do sthg along those lines in knative/build.

But we do want to do the timeout check before we get the next task, since what's the point in getting a task we're just going to cancel anyway, right?

So, I think the pipelinerun controller should update all already created taskrun status to cancelled (except the one already cancelled). Then the taskrun controller can act on that (and thus not schedule anything if a taskrun has cancelled status). But yeah, I think the timeout check should be one of the first check done by the pipelinerun reconcilier.

As of now, we calculate the full status after we get the next task - would we end up needing to do that twice, once before the timeout check and again after we get the next task (assuming we haven't timed out)?

🤔

Oh, and that all implicitly requires changes in Build's reconciler - which raises the question of whether to worry about doing that now or just waiting until Build has been folded into here.

I think it "still" make sense to make the change in knative/build too 👼

abayer · 2018-11-27T17:39:17Z

Ok, I've only barely looked at knative/build's code, so I'm a bit wary of making changes there myself. =)

I think we would want to only update the status to Canceled on TaskRuns that have not already completed - it's still worthwhile to be able to distinguish "this task ran to completion and succeeded/failed" from "this task started and then was canceled before completing".

bobcatfish · 2018-11-27T23:33:03Z

When the PipelineRun times out or is manually canceled, its status Reason goes to Canceled, and the PipelineRun reconciler would proceed to set all its TaskRuns to the Canceled Reason status as well.

I'd like to plug having two separate Reasons here:

Canceled for being manually interrupted
Timeout for having timed out

Seems worthwhile to be able to tell which was the cause.

So putting aside the implementation details there, I'm also trying to determine how this would work in PipelineRun's reconciler - if the Reason status was set to Canceled manually, then, sure, we just go do all that stuff.

I don't think we should be expecting users to modify anything in the status section (see these docs on k8s api statuses) - whatever the user changes afaik should be in spec, it should be the controller that should update the status reason.

As of now, we calculate the full status after we get the next task - would we end up needing to do that twice, once before the timeout check and again after we get the next task (assuming we haven't timed out)?

Maybe something like:

Check status of all tasks that have been started
Check if timeout has occurred and some tasks have not yet finished
If not timed out or cancelled, look for next task(s) to run

Oh, and that all implicitly requires changes in Build's reconciler - which raises the question of whether to worry about doing that now or just waiting until Build has been folded into here.

Contrary to @vdemeester 's advice I'd implement it here first, maybe only here 😇 something like:

Add cancellation to PipelineRuns
Add cancellation to Pipelines
Add timeouts to Pipelines
(We should probably be adding Timeouts to Tasks as part of Break Pipelines' dependency on Build #252)

So the only feature to add to Builds potentially would be cancellation?

@imjasonh correct me if I'm wrong but I don't think we want to end up in a world where folks have to add features to both Task + Build

vdemeester · 2018-11-28T08:25:55Z

So the only feature to add to Builds potentially would be cancellation?

@bobcatfish yes that's the only thing I suggest to implement in build — sorry if it wasn't clear. And I'm definitely ok for it to be implemented here first 😛

Add cancellation to PipelineRuns

Add cancellation to Pipelines

Add timeouts to Pipelines

Yes 👍 Should we split this into 2 issues (cancellation & timeout) ?

abayer · 2018-11-28T15:32:11Z

/assign @abayer

abayer · 2018-11-28T15:35:11Z

@vdemeester - yeah, it sounds reasonable to split into two issues. We can do the hackish approach for cancelling Builds when the PipelineRun times out for now (i.e., modifying the Build CRD instance on the fly to force it to timeout), and then add true cancellation support to Build as part of the Pipeline/PipelineRun cancellation work. Once real cancellation for Builds is in place, we can then switch the PipelineRun timeout behavior to use that instead of the hack.

This still needs work - for example, it can't kill running Builds yet, but I'm waiting for cancellation (i.e., tektoncd#272) to be a thing before actually implementing that part. Fixes tektoncd#222

Rebased and squashed the initial work, more to come. Fixes tektoncd#222

Fixes tektoncd#222

Fixes #222

bobcatfish added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. design This task is about creating and discussing a design labels Nov 3, 2018

bobcatfish added this to the 0.0.1 Alpha release milestone Nov 3, 2018

bobcatfish removed this from the 0.0.1 Alpha release milestone Nov 10, 2018

knative-prow-robot assigned abayer Nov 28, 2018

This was referenced Nov 28, 2018

Add the ability to cancel/abort a running PipelineRun without deleting the PipelineRun itself. #272

Closed

Once a Run finishes, it should never be invoked again #299

Closed

abayer mentioned this issue Dec 27, 2018

Add Pipeline-level timeouts #355

Merged

abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 23, 2019

Add Pipeline-level timeouts

b945146

Rebased and squashed the initial work, more to come. Fixes tektoncd#222

abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 24, 2019

Add Pipeline-level timeouts

d5437f3

Rebased and squashed the initial work, more to come. Fixes tektoncd#222

abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 25, 2019

Add Pipeline-level timeouts

cc1c191

Rebased and squashed the initial work, more to come. Fixes tektoncd#222

abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 25, 2019

Add Pipeline-level timeouts

5675667

Rebased and squashed the initial work, more to come. Fixes tektoncd#222

abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 28, 2019

Add Pipeline-level timeouts

67ea3a6

Rebased and squashed the initial work, more to come. Fixes tektoncd#222

abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 29, 2019

Add Pipeline-level timeouts, implement Task-level timeouts

ce606dd

Fixes tektoncd#222

knative-prow-robot closed this as completed in #355 Jan 30, 2019

knative-prow-robot pushed a commit that referenced this issue Jan 30, 2019

Add Pipeline-level timeouts, implement Task-level timeouts

23a2b91

Fixes #222

mchmarny unassigned abayer Mar 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design and implement timeouts #222

Design and implement timeouts #222

bobcatfish commented Nov 3, 2018

dwnusbaum commented Nov 5, 2018 •

edited

Loading

abayer commented Nov 5, 2018

abayer commented Nov 20, 2018

vdemeester commented Nov 27, 2018

knative-prow-robot commented Nov 27, 2018

abayer commented Nov 27, 2018 •

edited

Loading

abayer commented Nov 27, 2018

vdemeester commented Nov 27, 2018

abayer commented Nov 27, 2018

bobcatfish commented Nov 27, 2018

vdemeester commented Nov 28, 2018

abayer commented Nov 28, 2018

abayer commented Nov 28, 2018

Design and implement timeouts #222

Design and implement timeouts #222

Comments

bobcatfish commented Nov 3, 2018

Expected Behavior

Actual Behavior

Additional Info

dwnusbaum commented Nov 5, 2018 • edited Loading

abayer commented Nov 5, 2018

abayer commented Nov 20, 2018

vdemeester commented Nov 27, 2018

knative-prow-robot commented Nov 27, 2018

abayer commented Nov 27, 2018 • edited Loading

abayer commented Nov 27, 2018

vdemeester commented Nov 27, 2018

abayer commented Nov 27, 2018

bobcatfish commented Nov 27, 2018

vdemeester commented Nov 28, 2018

abayer commented Nov 28, 2018

abayer commented Nov 28, 2018

dwnusbaum commented Nov 5, 2018 •

edited

Loading

abayer commented Nov 27, 2018 •

edited

Loading