Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

taskreaper: Wait for tasks to stop running #1948

Merged

Conversation

aaronlehmann
Copy link
Collaborator

Currently, the task reaper only waits for a task's desired state to go
past "running" before deleting it. This can cause problems with rolling
updates when task-history-limit is set to 1. The old task can get
deleted by the reaper before the agent has a chance to update its actual
state to "shutdown", so the update never sees that the task has
completed, and has to wait awhile for a timeout.

This fixes it by making the task reaper only delete tasks whose desired
and actual state has moved past "running". It's also necessary to keep
slots in the "dirty" list until there is only one task in that slot with
desired state or actual state <= "running", so that the old task still
gets cleaned up once the actual state moves. Finally, "deleteTasks" is
changed to a map so that a task which is both part of a dirty slot and
orphaned won't cause two delete attempts (one of which would fail).

Note that this means tasks on unavailable nodes will stay around for
awhile, until the "orphaned" state is reached.

Tested by vendoring into docker and using the repro steps from moby/moby#28291.

cc @aluzzardi @dongluochen

Currently, the task reaper only waits for a task's desired state to go
past "running" before deleting it. This can cause problems with rolling
updates when task-history-limit is set to 1. The old task can get
deleted by the reaper before the agent has a chance to update its actual
state to "shutdown", so the update never sees that the task has
completed, and has to wait awhile for a timeout.

This fixes it by making the task reaper only delete tasks whose desired
*and* actual state has moved past "running". It's also necessary to keep
slots in the "dirty" list until there is only one task in that slot with
desired state or actual state <= "running", so that the old task still
gets cleaned up once the actual state moves. Finally, "deleteTasks" is
changed to a map so that a task which is both part of a dirty slot and
orphaned won't cause two delete attempts (one of which would fail).

Note that this means tasks on unavailable nodes will stay around for
awhile, until the "orphaned" state is reached.

Signed-off-by: Aaron Lehmann <[email protected]>
@dongluochen
Copy link
Contributor

Flaky test?

--- FAIL: TestDemoteToSingleManager (89.78s)
	Error Trace:	integration_test.go:118
			integration_test.go:351
	Error:		Received unexpected error worker node mrqdmyhbrpofzeljref7wt4v6 should not have manager status, returned &ManagerStatus{RaftID:8991989376972981745,Addr:127.0.0.1:33826,Leader:false,Reachability:UNREACHABLE,}
			github.com/docker/swarmkit/manager/state/raft/testutils.PollFuncWithTimeout
				/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/manager/state/raft/testutils/testutils.go:76: polling failed

@aaronlehmann
Copy link
Collaborator Author

Yeah, definitely unrelated.

#1939 might help.

@codecov-io
Copy link

codecov-io commented Feb 11, 2017

Codecov Report

Merging #1948 into master will increase coverage by 0.46%.

@@            Coverage Diff             @@
##           master    #1948      +/-   ##
==========================================
+ Coverage   54.05%   54.51%   +0.46%     
==========================================
  Files         108      108              
  Lines       18547    18548       +1     
==========================================
+ Hits        10026    10112      +86     
+ Misses       7289     7198      -91     
- Partials     1232     1238       +6

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ccdf5e0...2d575ef. Read the comment docs.

@dongluochen
Copy link
Contributor

LGTM

1 similar comment
@aluzzardi
Copy link
Member

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants