Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds "try" and "monitor-retry" options to consul lock command #1567

Merged
merged 7 commits into from
Jan 6, 2016

Conversation

slackpad
Copy link
Contributor

@slackpad slackpad commented Jan 6, 2016

The "try" option should close #780 and the "monitor-retry" option should close #1162.

Previously, it would try once "up to" the timeout, but in practice it would
just fall through. This modifies the behavior to block until the timeout has
been reached.
@slackpad
Copy link
Contributor Author

slackpad commented Jan 6, 2016

Tweaked things in that last commit so it always waits for the timeout period trying to get the lock vs. trying once "up to" the timeout, which usually just falls through right away. This should better match what users expect.

pairs, meta, err := kv.List(s.opts.Prefix, opts)
if err != nil {
// TODO (slackpad) - Make a real error type here instead of using
// a string check.
const serverError = "Unexpected response code: 500"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe at least move this out of the retry loop? Maybe we could use like an IsServerError(error) since I think we do this in a bunch of other places too IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha I was paste-n good call.

@ryanuber
Copy link
Member

ryanuber commented Jan 6, 2016

Looks like you already addressed the minor comments. This LGTM otherwise 🚢

slackpad added a commit that referenced this pull request Jan 6, 2016
Adds "try" and "monitor-retry" options to `consul lock` command
@slackpad slackpad merged commit 06926f3 into master Jan 6, 2016
@slackpad slackpad deleted the f-lock-try branch January 6, 2016 19:50
@issacg
Copy link

issacg commented Jan 17, 2016

Question: Is it possible to use the new monitor-retry with a (slightly) larger timeout to facilitate restarting consul servers without damaging the quarum? I'm not sure if there wouldn't be unintended side-affects at first glance, but it seems that if you gave a large enough value to defaultMonitorRetry (say 60) one could use a configuration management tool to wrap something like service consul restart on all servers in each datacenter simultaneously without needing to time the rolling update. If it takes more than 60 seconds for the restart to complete and release the lock, then something is likely wrong with your setup anyway. It's a bit naive but it seems that the code should allow this, no?

Also, based on the naming of DefaultMonitorRetryTime it implies that this ought to be overridable, but based on the usage of defaultMonitorRetry and monitor-retry it seems to be intended to be hardcoded at 1s intervals. I'm just wondering what the folks who wrote the code intended.

Thanks!

@slackpad
Copy link
Contributor Author

Hi @IsaacG we didn't design this feature with this use case in mind, but it seems like it could be possible to use it as part of a solution for this. One problem I can see though is that being able to take the lock doesn't necessarily mean that it's safe to take a server down, so doing this in a safe way would require more logic. For example, if one server out of three died during the upgrade and lost the lock, a second one would be able to get it and restart itself, putting the cluster into an outage condition. You'd still want to confirm things like last_log_index on the newly-added server before rolling another one - https://www.consul.io/docs/guides/servers.html.

For your second question, I named the DefaultMonitorRetryTime in case we decided to make it configurable later but opted for a 1 second time with just a configurable number of retries to keep the number of config knobs down. The time between retries isn't as important as the total time taken attempting retries so that seemed more important. I suppose if you had tons of nodes trying locks you might want to make that larger to avoid a thundering herd, but that's probably not a super common use case for locks.

@issacg
Copy link

issacg commented Jan 24, 2016

@slackpad thanks for the comments!

In some very initial toying around, I already discovered that the 1 second DefaultMonitorRetryTime is a problem when trying to rolling upgrade, because consul lock died really fast. I don't have the logs in front of me ATM, but IIRC the problem was that consul client was unable to talk to the agent, and thus tried to immediately release the lock which, again, it couldn't do. I haven't had time to delve in to the source yet (out of round tuits for now 😞) but hypothesized that increasing the retry time might avoid this.

Also, as you pointed out, there are a lot of rough edges that I'm admittedly completely ignoring for now. I just wanted to start with something as a PoC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

consul lock behaviour on leader election Add consul lock -try
3 participants