-
-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't use port open check to determine if reboot completed. Fixes #856. #857
Conversation
Changed output with this patch:
|
Alternative output that can be created by this:
So the output is not as pretty as before due to the errors, but at least it works without race-condition. |
a3f6b8a
to
6bc3e33
Compare
I pushed a small fix ( |
6bc3e33
to
37bc5af
Compare
# and show an 'x' as progress indicator in that case. | ||
self.log_continue("x") | ||
if last_reboot_output is not None and last_reboot_output != pre_reboot_last_reboot_output: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this call ssh.reset()
eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my current understanding, this is not needed, because the only thing reset()
is call shutdown()
which exits the control master process, and I think that gets cleaned up automatically when the connection dies due to the machine rebooting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, in theory you're right, but if the machine didn't manage to properly close the TCP socket (for example due to a hard reboot), the control master process is still alive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, you're right. I've changed this back to always reset at reboot.
I've also re-tested that change with the EC2 and Hetzner backends (though only Hetzner can do --hard
reboots, where this is relevant).
All good? This PR is the required base of my new PR #948, so it would be cool if we could finish this one. |
nixops/backends/__init__.py
Outdated
# manner, we compare the output of `last reboot` before and after | ||
# the reboot. Once the output has changed, the reboot is done. | ||
def get_last_reboot_output(): | ||
return self.run_command('last reboot --time-format iso | head -n1', capture_stdout=True).rstrip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the Hetzner rescue system and the actual NixOS system are using systemd
, so maybe it's a better idea to use systemd-analyze
because it fails whenever bootup is not finished. At least that would avoid the current vs. last string comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that can work because when NixOS is waiting for nixops keys to be uploaded (so, right here), you'll get
# systemd-analyze
Bootup is not yet finished. Please try again later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right... you're correct and after all we don't care about a truly finished reboot but only want to make sure the machine has rebooted after all, so last reboot
should work.
37bc5af
to
3cbb613
Compare
Updated the PR, I noticed I have to not only catch |
3cbb613
to
2fff961
Compare
Made another small improvement so that |
@nh2: Approved, but see my last comment. |
nixops/backends/__init__.py
Outdated
# command invocation changes. | ||
# We use timeout=10 so that the user gets some sense | ||
# of progress, as reboots can take a long time. | ||
return self.run_command('last reboot --time-format iso | head -n1', capture_stdout=True, timeout=10).rstrip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pipe could be broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
head -n1
-> head -n 1
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/head.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
2fff961
to
f83e244
Compare
OK, I've updated the commits to address the remaining comments. |
…OS#856. The old approach, waiting for the machine to not having an open port, and then waiting for it to be open again, was insufficient, because of the race condition that the machine rebooted so quickly that the port was immediately open again without nixops noticing that it went down. I experienced this on a Hetzner cloud server. The new approach checks the `last reboot` on the remote side to change, which is not racy.
f83e244
to
8f94a85
Compare
Hello! Thank you for this PR. In the past several months, some major changes have taken place in
This is all accumulating in to what I hope will be a NixOps 2.0 My hope is that by adding types and more thorough automated testing, However, because of the major changes, it has become likely that this If you would like to see this merge, please bring it up to date with Thank you again for the work you've done here, I am sorry to be Graham |
See #856
The old approach, waiting for the machine to not having an open
port, and then waiting for it to be open again, was insufficient,
because of the race condition that the machine rebooted so quickly
that the port was immediately open again without nixops noticing
that it went down. I experienced this on a Hetzner cloud server.
The new approach checks the
last reboot
on the remote sideto change, which is not racy.