-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
orchestrator: Use task history to evaluate restart policy #2332
orchestrator: Use task history to evaluate restart policy #2332
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2332 +/- ##
==========================================
- Coverage 60.35% 60.28% -0.08%
==========================================
Files 128 128
Lines 26002 26048 +46
==========================================
+ Hits 15694 15702 +8
- Misses 8910 8944 +34
- Partials 1398 1402 +4 |
serviceID: t.ServiceID, | ||
instanceTuple := orchestrator.SlotTuple{ | ||
Slot: t.Slot, | ||
ServiceID: t.ServiceID, | ||
} | ||
|
||
// Instance is not meaningful for "global" tasks, so they need to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: "Slot ID" instead of "Instance" in this comment, maybe?
|
||
var next *list.Element | ||
for e := restartInfo.restartedInstances.Front(); e != nil; e = next { | ||
next = e.Next() | ||
|
||
if e.Value.(restartedInstance).timestamp.After(lookback) { | ||
for e2 := restartInfo.restartedInstances.Back(); e2 != nil; e2 = e2.Prev() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I don't know this super well, so I would like to verify my understanding:
restartInfo.restartedInstances
is a doubly-linked list where the front is the oldest restarted instance, and the back is the newest restarted instance?- This block does 2 things simultaneously: clearing any recorded history that happened before
lookback
, and also identifying the number of restarts betweenlookback
and the timestamp for this task?
If so, it might be slightly easier to read if this were broken into 2 separate loops - one that reaps restart history, and the other that ignores any restarts that happened after the task status timestamp.
But if it truncates the restart history relative to the timestamp of the task, wouldn't calling ShouldRestart
on a later task truncate history that might be needed for an earlier task? Or would that history never be needed again?
Also, what would cause restartedInstances
to have more restarts than the last status update of the task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restartInfo.restartedInstances is a doubly-linked list where the front is the oldest restarted instance, and the back is the newest restarted instance?
Correct
This block does 2 things simultaneously: clearing any recorded history that happened before lookback, and also identifying the number of restarts between lookback and the timestamp for this task?
Yeah, this is a mess.
If so, it might be slightly easier to read if this were broken into 2 separate loops - one that reaps restart history, and the other that ignores any restarts that happened after the task status timestamp.
I agree. There's no reason for them to be nested. Thanks for the suggestion. I've moved the inner loop after the outer one. This actually revealed a bug, because the wrong iteration variable was being used :(.
But if it truncates the restart history relative to the timestamp of the task, wouldn't calling ShouldRestart on a later task truncate history that might be needed for an earlier task? Or would that history never be needed again?
That's right that the history would never be needed again. We would only ever make a restart decision for the latest task in a slot.
I just discovered a minor case where we do expect to have the history for older tasks, and I've pushed a fix for it.
Also, what would cause restartedInstances to have more restarts than the last status update of the task?
ShouldRestart
gets called on older tasks as well, for example when updatableAndDeadSlots
calls IsTaskUpdatable
. That said, I took a closer look at this and the logic is pretty bogus. An older task could get updated by the updater, which doesn't really make sense - it should only consider the most recent task. I've pushed another commit that redoes this. Now shouldRestart
should only be called on the most recent task. I've left the loop that ignores restarts which happened after the task was restarted, because it seems most correct to have it in place, but it shouldn't be necessary anymore, and I'm fine with removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left the loop that ignores restarts which happened after the task was restarted, because it seems most correct to have it in place, but it shouldn't be necessary anymore, and I'm fine with removing it.
I'm good with leaving it in - as you said, it's not wrong.
|
||
// AppliedBy gives the node ID of the manager that applied this task | ||
// status update to the Task object. | ||
string applied_by = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking question - more information is always helpful for debugging, but what is this intended to be used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thinking is that since we're storing a timestamp, it's useful to know what frame of reference that timestamp comes from, in case we later add logic to handle clock skew.
I'm fine with removing this for now. I suppose it could be useful for debugging, but that's not a very strong argument for having it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd be a cool thing to have eventually - would we store historical skew data between the manager nodes for the adjustment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -284,10 +284,6 @@ func testUpdaterRollback(t *testing.T, rollbackFailureAction api.UpdateConfig_Fa | |||
assert.Equal(t, observedTask.Status.State, api.TaskStateNew) | |||
assert.Equal(t, observedTask.Spec.GetContainer().Image, "image1") | |||
|
|||
observedTask = testutils.WatchTaskCreate(t, watchCreate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the rollback only create 2 of the old tasks in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had an elaborate explanation ready to go for this, but I went to verify it, and it turned out to be incorrect. The updater actually does roll back all three tasks. I've restored the check.
This timestamp will be useful for tracking restart history, since the manager clocks may more trustworthy than agent clocks. Signed-off-by: Aaron Lehmann <[email protected]>
Previously, restart conditions other than "OnAny" were honored on a best-effort basis. A service-level reconciliation, for example after a leader election, would see that not enough tasks were running, and start replacement tasks regardless of the restart policy. This limited the usefulness of the other restart conditions. This change makes the orchestrators check historic tasks to figure out if a task should be restarted or not, on service reconciliation. Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
…iliation decisions Only look at the most recent task to see if it should be restarted. Signed-off-by: Aaron Lehmann <[email protected]>
Signed-off-by: Aaron Lehmann <[email protected]>
50f3d19
to
1b7b99d
Compare
This PR passes integration-cli tests. |
LGTM, thanks for all the explanations! |
Let's go for it then. Hopefully this doesn't cause any regressions - it's a fairly involved change. But I think we found a good model and we'll fix a longstanding orchestration issue. |
Previously, restart conditions other than "OnAny" were honored on a best-effort basis. A service-level reconciliation, for example after a leader election, would see that not enough tasks were running, and start replacement tasks regardless of the restart policy. This limited the usefulness of the other restart conditions.
This change is similar to #2327, but instead of adding a
DontRestart
flag, the orchestrator figures out on the fly whether a task should be restarted or not, using historic tasks for context.In addition to changing the reconciliation logic, this adds manager-side timestamps to
TaskStatus
, and changes the task reaper to preserve at leastMaxAttempts + 1
tasks. It has the side effect of allowing the restart policy to be applied correctly across leader elections (previously, this state was tracked locally).It's an alternative to both #2290 and #2327.
ping @aluzzardi @cyli