[manager/dispatcher] Fix deadlock in dispatcher #2744

anshulpundir · 2018-09-10T18:46:35Z

There was a rare case where the dispatcher could end up deadlocked when
calling stop, which would cause the whole leadership change procedure to
go sideways, the dispatcher to pile up with goroutines, and the node to
crash.

In a nutshell, calls to the Session RPC end up in a (*Cond).Wait(),
waiting for a Broadcast that, once Stop is called, may never come. To
avoid that case, Stop, after being called and canceling the Dispatcher
context, does one final Broadcast to wake the sleeping waiters.

However, because the rpcRW lock, which stops Stop from proceeding until
all RPCs have returned, was previously obtained BEFORE the call to
Broadcast, Stop would never reach this final Broadcast call, waiting on
the Session RPCs to release the rpcRW lock, which they could not do
until Broadcast was called. Hence, deadlock.

To fix this, we simple have to move this final Broadcast to above the
attempt to acquire the rpcRW lock, allowing everything to proceed
correctly.

Signed-off-by: Drew Erny [email protected]

anshulpundir · 2018-09-10T18:47:02Z

manager/dispatcher/dispatcher.go

@@ -342,6 +342,19 @@ func (d *Dispatcher) Stop() error {
 	d.cancel()
 	d.mu.Unlock()

+	d.processUpdatesLock.Lock()
+	// when we called d.cancel(), there may have been waiters currently
+	// waiting. they would be forever sleeping if we didn't call Broadcast here


they => They

anshulpundir · 2018-09-10T18:48:32Z

manager/dispatcher/dispatcher.go

@@ -342,6 +342,19 @@ func (d *Dispatcher) Stop() error {
 	d.cancel()
 	d.mu.Unlock()

+	d.processUpdatesLock.Lock()
+	// when we called d.cancel(), there may have been waiters currently


there may have been waiters

Maybe also describe who the waiters are?

anshulpundir · 2018-09-10T19:00:45Z

manager/dispatcher/dispatcher.go

+	// more waits will start after this Broadcast, because before waiting they
+	// check if the context is canceled.
+	//
+	// if that context cancelation check were not present, it would be possible


Its not clear from this description that the waiters actually go on wait() holding the rpcRW. Do we want to clarify that?

There was a rare case where the dispatcher could end up deadlocked when calling stop, which would cause the whole leadership change procedure to go sideways, the dispatcher to pile up with goroutines, and the node to crash. In a nutshell, calls to the Session RPC end up in a (*Cond).Wait(), waiting for a Broadcast that, once Stop is called, may never come. To avoid that case, Stop, after being called and canceling the Dispatcher context, does one final Broadcast to wake the sleeping waiters. However, because the rpcRW lock, which stops Stop from proceeding until all RPCs have returned, was previously obtained BEFORE the call to Broadcast, Stop would never reach this final Broadcast call, waiting on the Session RPCs to release the rpcRW lock, which they could not do until Broadcast was called. Hence, deadlock. To fix this, we simple have to move this final Broadcast to above the attempt to acquire the rpcRW lock, allowing everything to proceed correctly. Signed-off-by: Drew Erny <[email protected]>

codecov · 2018-09-10T20:09:40Z

Codecov Report

Merging #2744 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2744      +/-   ##
==========================================
+ Coverage   61.68%   61.71%   +0.02%     
==========================================
  Files         134      134              
  Lines       21888    21888              
==========================================
+ Hits        13502    13508       +6     
+ Misses       6926     6917       -9     
- Partials     1460     1463       +3

anshulpundir · 2018-09-10T20:15:54Z

LGTM!

This also brings in these PRs from swarmkit: - moby/swarmkit#2691 - moby/swarmkit#2744 - moby/swarmkit#2732 - moby/swarmkit#2729 - moby/swarmkit#2748 Signed-off-by: Tibor Vass <[email protected]>

This also brings in these PRs from swarmkit: - moby/swarmkit#2691 - moby/swarmkit#2744 - moby/swarmkit#2732 - moby/swarmkit#2729 - moby/swarmkit#2748 Signed-off-by: Tibor Vass <[email protected]> Upstream-commit: cce1763d57b5c8fc446b0863517bb5313e7e53be Component: engine

antonybichon17 · 2018-10-03T00:04:02Z

How is this tested?

anshulpundir · 2018-10-03T00:12:17Z

For regressions only using unit/e2e tests, and that’s what we’ll recommend too. It’s a race condition so can’t be predictably reproduced @antonybichon17

anshulpundir mentioned this pull request Sep 10, 2018

[manager/dispatcher] Fix deadlock in dispatcher shutdown #2742

Closed

anshulpundir changed the title ~~Fix deadlock in dispatcher~~ [manager/dispatcher] Fix deadlock in dispatcher Sep 10, 2018

anshulpundir commented Sep 10, 2018

View reviewed changes

dperny force-pushed the fix-dispatcher-deadlock branch from c4f5c5a to 4f15251 Compare September 10, 2018 19:59

anshulpundir merged commit e24c2a4 into moby:master Sep 10, 2018

kolyshkin mentioned this pull request Sep 18, 2018

[DO NOT MERGE] TestAPISwarmLeaderElection: debug moby/moby#37833

Closed

anshulpundir mentioned this pull request Sep 19, 2018

[bump_v17.06] backport fix deadlock in dispatcher #2753

Merged

tiborvass mentioned this pull request Sep 22, 2018

[18.09] Remove boltdb dependency docker-archive/engine#60

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[manager/dispatcher] Fix deadlock in dispatcher #2744

[manager/dispatcher] Fix deadlock in dispatcher #2744

anshulpundir commented Sep 10, 2018

anshulpundir Sep 10, 2018

anshulpundir Sep 10, 2018

anshulpundir Sep 10, 2018

codecov bot commented Sep 10, 2018

anshulpundir commented Sep 10, 2018

antonybichon17 commented Oct 3, 2018

anshulpundir commented Oct 3, 2018

[manager/dispatcher] Fix deadlock in dispatcher #2744

[manager/dispatcher] Fix deadlock in dispatcher #2744

Conversation

anshulpundir commented Sep 10, 2018

anshulpundir Sep 10, 2018

Choose a reason for hiding this comment

anshulpundir Sep 10, 2018

Choose a reason for hiding this comment

anshulpundir Sep 10, 2018

Choose a reason for hiding this comment

codecov bot commented Sep 10, 2018

Codecov Report

anshulpundir commented Sep 10, 2018

antonybichon17 commented Oct 3, 2018

anshulpundir commented Oct 3, 2018