Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul lock may spawn the watched process at shutdown #1155

Closed
mfischer-zd opened this issue Aug 4, 2015 · 1 comment
Closed

consul lock may spawn the watched process at shutdown #1155

mfischer-zd opened this issue Aug 4, 2015 · 1 comment

Comments

@mfischer-zd
Copy link
Contributor

We saw this unusual chain of events reported by consul lock after sending it a SIGTERM today:

2015-08-03_22:39:06.20297 consul:Setting up lock at path: xxx
2015-08-03_22:39:06.20308 consul:Attempting lock acquisition
# Lock was never acquired because another candidate held it

# SIGTERM was then sent
2015-08-04_19:56:20.23976 consul:Shutdown triggered, killing child
2015-08-04_19:56:20.23988 consul:No child process to kill
# Correct, but then...
2015-08-04_19:56:20.23992 consul:Starting handler 'myapp'
# Wait, what?
2015-08-04_19:56:20.25634 consul:Cleanup succeeded

After this, consul exited and myapp was orphaned and reparented to PID 1.

@mfischer-zd
Copy link
Contributor Author

I've investigated this matter further and it appears to be a race condition that can occur when a SIGTERM is sent to multiple consul lock processes at approximately the same time.

The short story is this:

  1. host1 (current lock holder) receives SIGTERM and releases the lock
  2. host2 acquires the lock and runs LockCommand.startChild() in a goroutine
  3. host2: in startChild(), cmd.Start() is reached, forks and execs app
  4. host2 receives SIGTERM
  5. host2 calls LockCommand.killChild()
  6. host2: in killChild() tries to evaluate cmd.Process, but it hasn't been set yet in the goroutine running startChild(), so app isn't sent the signal (startChild() returns early)
  7. host2 exits, leaving an orphaned app

mfischer-zd added a commit to mfischer-zd/consul that referenced this issue Aug 5, 2015
Fix a race condition between startChild() and killChild() that could
lead to an orphaned managed process.

Fixes hashicorp#1155
mfischer-zd added a commit to zendesk/consul that referenced this issue Aug 25, 2015
Fix a race condition between startChild() and killChild() that could
lead to an orphaned managed process.

Fixes hashicorp#1155
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant