Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix conflict between handleReload and antiEntropy critical sections #1235

Merged
merged 3 commits into from
Sep 17, 2015

Conversation

wuub
Copy link
Contributor

@wuub wuub commented Sep 11, 2015

see: #1173 #1173

Reasoning: somewhere during consul development Pause()/Resume() and
PauseSync()/ResumeSync() were added to protect larger changes to
agent's localState. A few of the places that it tries to protect are:

  • (a *Agent) AddService(...) # part of the method

  • (c *Command) handleReload(...) # almost the whole method

  • (l *localState) antiEntropy(...)# isPaused() prevents syncChanges()

    The main problem is, that in the middle of handleReload(...)'s
    critical section it indirectly (loadServices()) calls AddService(...).
    AddService() in turn calls Pause() to protect itself against
    syncChanges(). At the end of AddService() a defered call to Resume() is
    made.

    With the current implementation, this releases
    isPaused() "lock" in the middle of handleReload() allowing antiEntropy
    to kick in while configuration reload is still in progress.
    Specifically almost all services and probably all check are unloaded
    when syncChanges() is allowed to run.

    This in turn can causes massive service/check de-/re-registration,
    and since checks are by default registered in the critical state,
    a majority of services on a node can be marked as failing.
    It's made worse with automation, often calling consul reload in close
    proximity on many nodes in the cluster.

    This change basically turns Pause()/Resume() into P()/V() of
    a garden-variety semaphore. Allowing Pause() to be called multiple times,
    and releasing isPaused() only after all matching/defered Resumes() are
    called as well.

    TODO/NOTE: as with many semaphore implementations, it might be reasonable
    to panic() if l.paused ever becomes negativ

see: hashicorp#1173 hashicorp#1173

Reasoning: somewhere during consul development Pause()/Resume() and
PauseSync()/ResumeSync() were added to protect larger changes to
agent's localState.  A few of the places that it tries to protect are:

- (a *Agent) AddService(...)      # part of the method
- (c *Command) handleReload(...)  # almost the whole method
- (l *localState) antiEntropy(...)# isPaused() prevents syncChanges()

The main problem is, that in the middle of handleReload(...)'s
critical section it indirectly (loadServices()) calls  AddService(...).
AddService() in turn calls Pause() to protect itself against
syncChanges(). At the end of AddService() a defered call to Resume() is
made.

With the current implementation, this releases
isPaused() "lock" in the middle of handleReload() allowing antiEntropy
to kick in while configuration reload is still in progress.
Specifically almost all services and probably all check are unloaded
when syncChanges() is allowed to run.

This in turn can causes massive service/check de-/re-registration,
and since checks are by default registered in the critical state,
a majority of services on a node can be marked as failing.
It's made worse with automation, often calling `consul reload` in close
proximity on many nodes in the cluster.

This change basically turns Pause()/Resume() into P()/V() of
a garden-variety semaphore. Allowing Pause() to be called multiple times,
and releasing isPaused() only after all matching/defered Resumes() are
called as well.

TODO/NOTE: as with many semaphore implementations, it might be reasonable
to panic() if l.paused ever becomes negative.
@slackpad
Copy link
Contributor

This looks good. I don't think my patches under 19e10e7 are necessary since the lock is already there to protect critical sections.

Want to take a crack at preventing multiple reloads as well?

}

// Resume is used to resume state synchronization
func (l *localState) Resume() {
atomic.StoreInt32(&l.paused, 0)
atomic.AddInt32(&l.paused, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this returns the new value, I think I would put the panic in here to immediately catch any mismatch of these if it underflows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proper check added in the c4537ed.
I'm more concerned about overflows TBH, there is no way to detect them at the moment and one missing defer to .Resume() can block antiEntropy forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this as well. We could enforce an arbitrary upper limit on the semaphore to catch the case where someone is walking it up, which is probably the most likely programming error, but it won't catch an oddball bad path that doesn't get called often.

@wuub wuub changed the title make Pause()/Resume()/isPaused() behave more like a semaphore fix conflict between handleReload and antiEntropy critical sections Sep 17, 2015
@wuub
Copy link
Contributor Author

wuub commented Sep 17, 2015

@slackpad

Want to take a crack at preventing multiple reloads as well?

I'm pretty sure handleReload() is only called from handleSignals(), and AFAICT handleSignals is fully synchronous (https://github.com/hashicorp/consul/blob/master/command/agent/command.go#L797) and as such doesn't need a mutex to prevent multiple reloads. (/cc: @ryanuber ?)

It seems that all the interference was caused by handleReload and antiEntropy fighting with each other NOT multiple configuration reloads running simultaneously. It's just easier to trigger a Pause/Resume conflict with antiEntropy if you force consul into contant reload mode by spaming SIGHUP for a while.

slackpad added a commit that referenced this pull request Sep 17, 2015
fix conflict between handleReload and antiEntropy critical sections
@slackpad slackpad merged commit 0b05dbe into hashicorp:master Sep 17, 2015
@slackpad
Copy link
Contributor

Sounds good. Thanks for your fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants