[bump_v17.06] backport fix deadlock in dispatcher #2753

anshulpundir · 2018-09-19T22:42:11Z

Backport of #2744
cherry-pick was clean.

There was a rare case where the dispatcher could end up deadlocked when
calling stop, which would cause the whole leadership change procedure to
go sideways, the dispatcher to pile up with goroutines, and the node to
crash.

In a nutshell, calls to the Session RPC end up in a (*Cond).Wait(),
waiting for a Broadcast that, once Stop is called, may never come. To
avoid that case, Stop, after being called and canceling the Dispatcher
context, does one final Broadcast to wake the sleeping waiters.

However, because the rpcRW lock, which stops Stop from proceeding until
all RPCs have returned, was previously obtained BEFORE the call to
Broadcast, Stop would never reach this final Broadcast call, waiting on
the Session RPCs to release the rpcRW lock, which they could not do
until Broadcast was called. Hence, deadlock.

To fix this, we simple have to move this final Broadcast to above the
attempt to acquire the rpcRW lock, allowing everything to proceed
correctly.

There was a rare case where the dispatcher could end up deadlocked when calling stop, which would cause the whole leadership change procedure to go sideways, the dispatcher to pile up with goroutines, and the node to crash. In a nutshell, calls to the Session RPC end up in a (*Cond).Wait(), waiting for a Broadcast that, once Stop is called, may never come. To avoid that case, Stop, after being called and canceling the Dispatcher context, does one final Broadcast to wake the sleeping waiters. However, because the rpcRW lock, which stops Stop from proceeding until all RPCs have returned, was previously obtained BEFORE the call to Broadcast, Stop would never reach this final Broadcast call, waiting on the Session RPCs to release the rpcRW lock, which they could not do until Broadcast was called. Hence, deadlock. To fix this, we simple have to move this final Broadcast to above the attempt to acquire the rpcRW lock, allowing everything to proceed correctly. Signed-off-by: Drew Erny <[email protected]>

dperny · 2018-09-20T15:50:04Z

@anshulpundir there's a race detector failure in the test...

codecov · 2018-09-20T16:46:49Z

Codecov Report

Merging #2753 into bump_v17.06 will decrease coverage by 0.18%.
The diff coverage is 100%.

@@              Coverage Diff               @@
##           bump_v17.06   #2753      +/-   ##
==============================================
- Coverage        61.28%   61.1%   -0.19%     
==============================================
  Files              121     121              
  Lines            20215   20172      -43     
==============================================
- Hits             12389   12326      -63     
- Misses            6452    6491      +39     
+ Partials          1374    1355      -19

anshulpundir requested review from thaJeztah and dperny September 19, 2018 22:42

wk8 approved these changes Sep 20, 2018

View reviewed changes

dperny merged commit b0a9eab into moby:bump_v17.06 Sep 20, 2018

thaJeztah changed the title ~~Fix deadlock in dispatcher~~ [bump_v17.06] backport fix deadlock in dispatcher Oct 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bump_v17.06] backport fix deadlock in dispatcher #2753

[bump_v17.06] backport fix deadlock in dispatcher #2753

anshulpundir commented Sep 19, 2018

dperny commented Sep 20, 2018

codecov bot commented Sep 20, 2018

[bump_v17.06] backport fix deadlock in dispatcher #2753

[bump_v17.06] backport fix deadlock in dispatcher #2753

Conversation

anshulpundir commented Sep 19, 2018

dperny commented Sep 20, 2018

codecov bot commented Sep 20, 2018

Codecov Report