Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design and implement timeouts #222

Closed
bobcatfish opened this issue Nov 3, 2018 · 13 comments
Closed

Design and implement timeouts #222

bobcatfish opened this issue Nov 3, 2018 · 13 comments
Labels
design This task is about creating and discussing a design help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@bobcatfish
Copy link
Collaborator

Expected Behavior

A user should be able to specify timeouts at the:

  • Pipeline level
  • Task level

If a user does not specify timeouts, then some default should be used and it should be clear what that is:

  • Maybe infinite for a Pipeline
  • Tasks can use the same default as Builds

Actual Behavior

Currently this is possible via the build spec for Tasks only.

Additional Info

@bobcatfish bobcatfish added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. design This task is about creating and discussing a design labels Nov 3, 2018
@bobcatfish bobcatfish added this to the 0.0.1 Alpha release milestone Nov 3, 2018
@dwnusbaum
Copy link
Contributor

dwnusbaum commented Nov 5, 2018

In Jenkins Pipeline, users are able to specify whether a timeout is absolute or relative to the last log activity (activity parameter here). I think the relative option is mostly used for long builds where an absolute timeout would be too long to be useful, but the build is expected to produce log output very consistently. In that case, a timeout after 5 minutes of no log activity can be helpful to make sure builds are not stuck doing nothing for hours before they fail.

I'm not necessarily convinced that relative timeouts are necessary, or if log output is the best way to determine activity, but it seems like something worth discussing.

@abayer
Copy link
Contributor

abayer commented Nov 5, 2018

I'm personally not a huge fan of the relative timeouts in Jenkins - it seems like a useful thing, but I find that in practice, the use cases for "timeout after inactivity" that aren't adequately covered by "timeout after absolute amount of time" are pretty negligible, and probably not worth the extra effort to implement/support. But that's just my two cents.

@bobcatfish bobcatfish removed this from the 0.0.1 Alpha release milestone Nov 10, 2018
@abayer
Copy link
Contributor

abayer commented Nov 20, 2018

Now that I've spent a bit more time digging into the actual code - looks like this already exists for Task, via Build, so we'll just need to add the same thing to Pipeline, yes?

@vdemeester
Copy link
Member

/assign @abayer

@knative-prow-robot
Copy link

@vdemeester: GitHub didn't allow me to assign the following users: abayer.

Note that only knative members and repo collaborators can be assigned.
For more information please see the contributor guide

In response to this:

/assign @abayer

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abayer
Copy link
Contributor

abayer commented Nov 27, 2018

So from discussion in Slack, it feels like the first step is a more general "cancel this running PipelineRun" mechanism, which would be utilized both for timing out the PipelineRun and for manually canceling a running PipelineRun without deleting the PipelineRun/TaskRun/etc.

This would probably entail a new Canceled (or Cancelled, I ain't gonna get into US/UK spelling religious wars here!) Reason status for all of PipelineRun, TaskRun, and Build. When the PipelineRun times out or is manually canceled, its status Reason goes to Canceled, and the PipelineRun reconciler would proceed to set all its TaskRuns to the Canceled Reason status as well. And then the TaskRun controller would set its Build's Reason status to Canceled, and finally the Build controller would see the new Canceled Reason status and terminate its pod, same as it does currently for Build's Timeout configuration.

So putting aside the implementation details there, I'm also trying to determine how this would work in PipelineRun's reconciler - if the Reason status was set to Canceled manually, then, sure, we just go do all that stuff. But for timing out the PipelineRun, we'd need to calculate the PipelineRun's status before we check for whether we've timed out - since we don't want to timeout an actually completed PipelineRun. That'd just be goofy. But we do want to do the timeout check before we get the next task, since what's the point in getting a task we're just going to cancel anyway, right? As of now, we calculate the full status after we get the next task - would we end up needing to do that twice, once before the timeout check and again after we get the next task (assuming we haven't timed out)?

@abayer
Copy link
Contributor

abayer commented Nov 27, 2018

Oh, and that all implicitly requires changes in Build's reconciler - which raises the question of whether to worry about doing that now or just waiting until Build has been folded into here.

@vdemeester
Copy link
Member

But for timing out the PipelineRun, we'd need to calculate the PipelineRun's status before we check for whether we've timed out - since we don't want to timeout an actually completed PipelineRun. That'd just be goofy.

Yes, I'm pretty sure we already do sthg along those lines in knative/build.

But we do want to do the timeout check before we get the next task, since what's the point in getting a task we're just going to cancel anyway, right?

So, I think the pipelinerun controller should update all already created taskrun status to cancelled (except the one already cancelled). Then the taskrun controller can act on that (and thus not schedule anything if a taskrun has cancelled status). But yeah, I think the timeout check should be one of the first check done by the pipelinerun reconcilier.

As of now, we calculate the full status after we get the next task - would we end up needing to do that twice, once before the timeout check and again after we get the next task (assuming we haven't timed out)?

🤔

Oh, and that all implicitly requires changes in Build's reconciler - which raises the question of whether to worry about doing that now or just waiting until Build has been folded into here.

I think it "still" make sense to make the change in knative/build too 👼

@abayer
Copy link
Contributor

abayer commented Nov 27, 2018

Ok, I've only barely looked at knative/build's code, so I'm a bit wary of making changes there myself. =)

I think we would want to only update the status to Canceled on TaskRuns that have not already completed - it's still worthwhile to be able to distinguish "this task ran to completion and succeeded/failed" from "this task started and then was canceled before completing".

@bobcatfish
Copy link
Collaborator Author

When the PipelineRun times out or is manually canceled, its status Reason goes to Canceled, and the PipelineRun reconciler would proceed to set all its TaskRuns to the Canceled Reason status as well.

I'd like to plug having two separate Reasons here:

  1. Canceled for being manually interrupted
  2. Timeout for having timed out

Seems worthwhile to be able to tell which was the cause.

So putting aside the implementation details there, I'm also trying to determine how this would work in PipelineRun's reconciler - if the Reason status was set to Canceled manually, then, sure, we just go do all that stuff.

I don't think we should be expecting users to modify anything in the status section (see these docs on k8s api statuses) - whatever the user changes afaik should be in spec, it should be the controller that should update the status reason.

As of now, we calculate the full status after we get the next task - would we end up needing to do that twice, once before the timeout check and again after we get the next task (assuming we haven't timed out)?

Maybe something like:

  1. Check status of all tasks that have been started
  2. Check if timeout has occurred and some tasks have not yet finished
  3. If not timed out or cancelled, look for next task(s) to run

Oh, and that all implicitly requires changes in Build's reconciler - which raises the question of whether to worry about doing that now or just waiting until Build has been folded into here.

Contrary to @vdemeester 's advice I'd implement it here first, maybe only here 😇 something like:

  1. Add cancellation to PipelineRuns
  2. Add cancellation to Pipelines
  3. Add timeouts to Pipelines
    (We should probably be adding Timeouts to Tasks as part of Break Pipelines' dependency on Build #252)

So the only feature to add to Builds potentially would be cancellation?

@imjasonh correct me if I'm wrong but I don't think we want to end up in a world where folks have to add features to both Task + Build

@vdemeester
Copy link
Member

So the only feature to add to Builds potentially would be cancellation?

@bobcatfish yes that's the only thing I suggest to implement in build — sorry if it wasn't clear. And I'm definitely ok for it to be implemented here first 😛

  1. Add cancellation to PipelineRuns
  2. Add cancellation to Pipelines
  3. Add timeouts to Pipelines

Yes 👍 Should we split this into 2 issues (cancellation & timeout) ?

@abayer
Copy link
Contributor

abayer commented Nov 28, 2018

/assign @abayer

@abayer
Copy link
Contributor

abayer commented Nov 28, 2018

@vdemeester - yeah, it sounds reasonable to split into two issues. We can do the hackish approach for cancelling Builds when the PipelineRun times out for now (i.e., modifying the Build CRD instance on the fly to force it to timeout), and then add true cancellation support to Build as part of the Pipeline/PipelineRun cancellation work. Once real cancellation for Builds is in place, we can then switch the PipelineRun timeout behavior to use that instead of the hack.

abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Dec 27, 2018
This still needs work - for example, it can't kill running Builds yet,
but I'm waiting for cancellation (i.e., tektoncd#272) to be a thing before
actually implementing that part.

Fixes tektoncd#222
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 22, 2019
This still needs work - for example, it can't kill running Builds yet,
but I'm waiting for cancellation (i.e., tektoncd#272) to be a thing before
actually implementing that part.

Fixes tektoncd#222
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 23, 2019
Rebased and squashed the initial work, more to come.

Fixes tektoncd#222
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 24, 2019
Rebased and squashed the initial work, more to come.

Fixes tektoncd#222
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 25, 2019
Rebased and squashed the initial work, more to come.

Fixes tektoncd#222
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 25, 2019
Rebased and squashed the initial work, more to come.

Fixes tektoncd#222
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 28, 2019
Rebased and squashed the initial work, more to come.

Fixes tektoncd#222
abayer added a commit to abayer/tektoncd-pipeline that referenced this issue Jan 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design This task is about creating and discussing a design help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

5 participants