Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orchestrator: Use task history to evaluate restart policy #2332

Merged
merged 8 commits into from
Jul 27, 2017

Conversation

aaronlehmann
Copy link
Collaborator

Previously, restart conditions other than "OnAny" were honored on a best-effort basis. A service-level reconciliation, for example after a leader election, would see that not enough tasks were running, and start replacement tasks regardless of the restart policy. This limited the usefulness of the other restart conditions.

This change is similar to #2327, but instead of adding a DontRestart flag, the orchestrator figures out on the fly whether a task should be restarted or not, using historic tasks for context.

In addition to changing the reconciliation logic, this adds manager-side timestamps to TaskStatus, and changes the task reaper to preserve at least MaxAttempts + 1 tasks. It has the side effect of allowing the restart policy to be applied correctly across leader elections (previously, this state was tracked locally).

It's an alternative to both #2290 and #2327.

ping @aluzzardi @cyli

@codecov
Copy link

codecov bot commented Jul 26, 2017

Codecov Report

Merging #2332 into master will decrease coverage by 0.07%.
The diff coverage is 83.07%.

@@            Coverage Diff             @@
##           master    #2332      +/-   ##
==========================================
- Coverage   60.35%   60.28%   -0.08%     
==========================================
  Files         128      128              
  Lines       26002    26048      +46     
==========================================
+ Hits        15694    15702       +8     
- Misses       8910     8944      +34     
- Partials     1398     1402       +4

serviceID: t.ServiceID,
instanceTuple := orchestrator.SlotTuple{
Slot: t.Slot,
ServiceID: t.ServiceID,
}

// Instance is not meaningful for "global" tasks, so they need to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: "Slot ID" instead of "Instance" in this comment, maybe?


var next *list.Element
for e := restartInfo.restartedInstances.Front(); e != nil; e = next {
next = e.Next()

if e.Value.(restartedInstance).timestamp.After(lookback) {
for e2 := restartInfo.restartedInstances.Back(); e2 != nil; e2 = e2.Prev() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I don't know this super well, so I would like to verify my understanding:

  1. restartInfo.restartedInstances is a doubly-linked list where the front is the oldest restarted instance, and the back is the newest restarted instance?
  2. This block does 2 things simultaneously: clearing any recorded history that happened before lookback, and also identifying the number of restarts between lookback and the timestamp for this task?

If so, it might be slightly easier to read if this were broken into 2 separate loops - one that reaps restart history, and the other that ignores any restarts that happened after the task status timestamp.

But if it truncates the restart history relative to the timestamp of the task, wouldn't calling ShouldRestart on a later task truncate history that might be needed for an earlier task? Or would that history never be needed again?

Also, what would cause restartedInstances to have more restarts than the last status update of the task?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restartInfo.restartedInstances is a doubly-linked list where the front is the oldest restarted instance, and the back is the newest restarted instance?

Correct

This block does 2 things simultaneously: clearing any recorded history that happened before lookback, and also identifying the number of restarts between lookback and the timestamp for this task?

Yeah, this is a mess.

If so, it might be slightly easier to read if this were broken into 2 separate loops - one that reaps restart history, and the other that ignores any restarts that happened after the task status timestamp.

I agree. There's no reason for them to be nested. Thanks for the suggestion. I've moved the inner loop after the outer one. This actually revealed a bug, because the wrong iteration variable was being used :(.

But if it truncates the restart history relative to the timestamp of the task, wouldn't calling ShouldRestart on a later task truncate history that might be needed for an earlier task? Or would that history never be needed again?

That's right that the history would never be needed again. We would only ever make a restart decision for the latest task in a slot.

I just discovered a minor case where we do expect to have the history for older tasks, and I've pushed a fix for it.

Also, what would cause restartedInstances to have more restarts than the last status update of the task?

ShouldRestart gets called on older tasks as well, for example when updatableAndDeadSlots calls IsTaskUpdatable. That said, I took a closer look at this and the logic is pretty bogus. An older task could get updated by the updater, which doesn't really make sense - it should only consider the most recent task. I've pushed another commit that redoes this. Now shouldRestart should only be called on the most recent task. I've left the loop that ignores restarts which happened after the task was restarted, because it seems most correct to have it in place, but it shouldn't be necessary anymore, and I'm fine with removing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left the loop that ignores restarts which happened after the task was restarted, because it seems most correct to have it in place, but it shouldn't be necessary anymore, and I'm fine with removing it.

I'm good with leaving it in - as you said, it's not wrong.


// AppliedBy gives the node ID of the manager that applied this task
// status update to the Task object.
string applied_by = 7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking question - more information is always helpful for debugging, but what is this intended to be used for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thinking is that since we're storing a timestamp, it's useful to know what frame of reference that timestamp comes from, in case we later add logic to handle clock skew.

I'm fine with removing this for now. I suppose it could be useful for debugging, but that's not a very strong argument for having it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd be a cool thing to have eventually - would we store historical skew data between the manager nodes for the adjustment?

Copy link
Member

@aluzzardi aluzzardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -284,10 +284,6 @@ func testUpdaterRollback(t *testing.T, rollbackFailureAction api.UpdateConfig_Fa
assert.Equal(t, observedTask.Status.State, api.TaskStateNew)
assert.Equal(t, observedTask.Spec.GetContainer().Image, "image1")

observedTask = testutils.WatchTaskCreate(t, watchCreate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the rollback only create 2 of the old tasks in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an elaborate explanation ready to go for this, but I went to verify it, and it turned out to be incorrect. The updater actually does roll back all three tasks. I've restored the check.

This timestamp will be useful for tracking restart history, since the
manager clocks may more trustworthy than agent clocks.

Signed-off-by: Aaron Lehmann <[email protected]>
Previously, restart conditions other than "OnAny" were honored on a
best-effort basis. A service-level reconciliation, for example after a
leader election, would see that not enough tasks were running, and start
replacement tasks regardless of the restart policy. This limited the
usefulness of the other restart conditions.

This change makes the orchestrators check historic tasks to figure out
if a task should be restarted or not, on service reconciliation.

Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
…iliation decisions

Only look at the most recent task to see if it should be restarted.

Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
@aaronlehmann aaronlehmann force-pushed the restart-based-on-task-history branch from 50f3d19 to 1b7b99d Compare July 27, 2017 20:37
@aaronlehmann
Copy link
Collaborator Author

This PR passes integration-cli tests.

@cyli
Copy link
Contributor

cyli commented Jul 27, 2017

LGTM, thanks for all the explanations!

@aaronlehmann
Copy link
Collaborator Author

Let's go for it then. Hopefully this doesn't cause any regressions - it's a fairly involved change. But I think we found a good model and we'll fix a longstanding orchestration issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants