-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[manager/dispatcher] Fix deadlock in dispatcher #2744
Conversation
manager/dispatcher/dispatcher.go
Outdated
@@ -342,6 +342,19 @@ func (d *Dispatcher) Stop() error { | |||
d.cancel() | |||
d.mu.Unlock() | |||
|
|||
d.processUpdatesLock.Lock() | |||
// when we called d.cancel(), there may have been waiters currently | |||
// waiting. they would be forever sleeping if we didn't call Broadcast here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they => They
manager/dispatcher/dispatcher.go
Outdated
@@ -342,6 +342,19 @@ func (d *Dispatcher) Stop() error { | |||
d.cancel() | |||
d.mu.Unlock() | |||
|
|||
d.processUpdatesLock.Lock() | |||
// when we called d.cancel(), there may have been waiters currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there may have been waiters
Maybe also describe who the waiters are?
manager/dispatcher/dispatcher.go
Outdated
// more waits will start after this Broadcast, because before waiting they | ||
// check if the context is canceled. | ||
// | ||
// if that context cancelation check were not present, it would be possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not clear from this description that the waiters actually go on wait() holding the rpcRW. Do we want to clarify that?
There was a rare case where the dispatcher could end up deadlocked when calling stop, which would cause the whole leadership change procedure to go sideways, the dispatcher to pile up with goroutines, and the node to crash. In a nutshell, calls to the Session RPC end up in a (*Cond).Wait(), waiting for a Broadcast that, once Stop is called, may never come. To avoid that case, Stop, after being called and canceling the Dispatcher context, does one final Broadcast to wake the sleeping waiters. However, because the rpcRW lock, which stops Stop from proceeding until all RPCs have returned, was previously obtained BEFORE the call to Broadcast, Stop would never reach this final Broadcast call, waiting on the Session RPCs to release the rpcRW lock, which they could not do until Broadcast was called. Hence, deadlock. To fix this, we simple have to move this final Broadcast to above the attempt to acquire the rpcRW lock, allowing everything to proceed correctly. Signed-off-by: Drew Erny <[email protected]>
c4f5c5a
to
4f15251
Compare
Codecov Report
@@ Coverage Diff @@
## master #2744 +/- ##
==========================================
+ Coverage 61.68% 61.71% +0.02%
==========================================
Files 134 134
Lines 21888 21888
==========================================
+ Hits 13502 13508 +6
+ Misses 6926 6917 -9
- Partials 1460 1463 +3 |
LGTM! |
This also brings in these PRs from swarmkit: - moby/swarmkit#2691 - moby/swarmkit#2744 - moby/swarmkit#2732 - moby/swarmkit#2729 - moby/swarmkit#2748 Signed-off-by: Tibor Vass <[email protected]>
This also brings in these PRs from swarmkit: - moby/swarmkit#2691 - moby/swarmkit#2744 - moby/swarmkit#2732 - moby/swarmkit#2729 - moby/swarmkit#2748 Signed-off-by: Tibor Vass <[email protected]> Upstream-commit: cce1763d57b5c8fc446b0863517bb5313e7e53be Component: engine
How is this tested? |
For regressions only using unit/e2e tests, and that’s what we’ll recommend too. It’s a race condition so can’t be predictably reproduced @antonybichon17 |
There was a rare case where the dispatcher could end up deadlocked when
calling stop, which would cause the whole leadership change procedure to
go sideways, the dispatcher to pile up with goroutines, and the node to
crash.
In a nutshell, calls to the Session RPC end up in a (*Cond).Wait(),
waiting for a Broadcast that, once Stop is called, may never come. To
avoid that case, Stop, after being called and canceling the Dispatcher
context, does one final Broadcast to wake the sleeping waiters.
However, because the rpcRW lock, which stops Stop from proceeding until
all RPCs have returned, was previously obtained BEFORE the call to
Broadcast, Stop would never reach this final Broadcast call, waiting on
the Session RPCs to release the rpcRW lock, which they could not do
until Broadcast was called. Hence, deadlock.
To fix this, we simple have to move this final Broadcast to above the
attempt to acquire the rpcRW lock, allowing everything to proceed
correctly.
Signed-off-by: Drew Erny [email protected]