Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use poweroff + polling instead of qvm-kill for forced shutdowns #534

Closed
wants to merge 6 commits into from

Conversation

eloquence
Copy link
Member

@eloquence eloquence commented Apr 14, 2020

Status

Towards #498

Test plan

Preparatory steps

  1. Check out this branch in your dev VM.
  2. (Recommended to stay sane, but not required) Apply this patch to your Updater.py (patch Updater.py instant_karma.patch) in order to skip most of the update run, since this PR only impacts the shut-down-and-reboot sequence.
  3. Apply the changes in this PR to the launcher versions in /opt/securedrop/launcher and /srv/salt/launcher (if only the /opt copy is overwritten, the updater itself will replace it on the next run).
  4. tail -f ~/.securedrop_launcher/launcher.log to follow along as you test.

Testing

  1. Run /opt/securedrop/launcher/sdw-launcher.py --skip-delta 0. This forces an updater run.
    • Observe that system VMs (sys-usb, sys-firewall, sys-net) are powered off and back on.
    • Observe that SecureDrop VMs and system VMs are up and running at the end of the run.
    • Observe that the logs correctly report the forced shutdown and restart of system VMs.
  2. Rinse and repeat a few times.

Checklist

  • No packaging implications
  • make test in dom0 not run yet (but should not be impacted; launcher has its own test suite)

@eloquence
Copy link
Member Author

In my own testing, I was not able to get #498 with those changes applied, but I've had that experience before. On the other hand, I was able to see #498 quickly (this time with sys-whonix reporting libxenlight pain) by reducing the timeout in _wait_for_is_running to 1 second.

If this approach looks sound and we still see #498 after it, my next best bet would be to use a similar method for starting VMs.

@eloquence eloquence changed the base branch from 531-long-live-sys-usb to master April 15, 2020 00:57
Copy link
Contributor

@emkll emkll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eloquence , I took a first pass through this, changes look in principle, left some comments inline.

As you already mentioned earlier, coverage still not yet at 100% for Updater.py:

sdw_updater_gui/Updater.py          310     19    94%   22, 481-508, 533-537

qubes.domains[vm].run("poweroff", user="root")
except subprocess.CalledProcessError as e:
# Exit codes 1 and 143 may occur with successful shutdown; log others
if e.returncode != 1 and e.returncode != 143:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you chooose these return codes? Should we not log in all cases, in case the error code is unexpected and non-zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing I was getting return code 1 almost all the time, and 143 some of the time. I'll confirm if those results hold, but if so, I'm not sure how useful it is to log return code 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing qvm-run -u root sys-net poweroff a few times, I mostly get exit code 143, and got exit code 1 a few times, without any clear conditional differences; nevertheless the command completed in all cases. poweroff itself triggers SIGTERM, which is what exit code 143 signifies; I'm guessing that its parent shell process may sometimes prematurely terminate. I do not know why the command sometimes returns exit code 1, but I never get exit code 0.

CC'ing @marmarek in case he can shed light on the exit code behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think poweroff command doesn't return until the system is really off. SIGTERM is probably systemd terminating all the user processes during shutdown. And exit code 1 is qrexec connection being terminated (because of domain shutdown) without sending any exit code first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explanation makes sense to me, thanks for jumping in. :)

return _wait_for_is_running(vm, False)


def _wait_for_is_running(vm, expected, timeout=60, interval=0.2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect to reuse this polling method with other VM states? If so, we should consider using get_power_state() function of a domain in the qubes API, which would give us more standard Qubes terminology for power state.

qubes.domains["sys-net"].get_power_state()
Running

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try making the _wait_for function generic, with this signature:

_wait_for(vm, condition, timeout=60, interval=0.2)

That way we can pass in any lambda we want. I personally prefer using a simple Boolean check for is_running but this gives us the flexibility to use different checks for different purposes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done in 111ccb0, haven't tested in Qubes yet. 111ccb0#diff-ad9e526e0e4fea47e3196b3e293a4d50R508 shows an example invocation.

- False if it did not
"""
start_time = time.time()
stop_time = start_time + timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested several values for interval here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, happy to tweak / test a lower value. The motivation for an interval is to avoid causing undue load, especially during a 10-20 second wait.

while time.time() < stop_time:
state = qubes.domains[vm].is_running()
elapsed = time.time() - start_time
if state == expected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like using the get_power_state state as describe above might make more sense here, though functionally equal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See above)


qubes = qubesadmin.Qubes()
except ImportError:
qubes = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log something when qubes=None, and probably test these lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have ideas for the error handling here, I'd appreciate guidance/collaboration on the branch. The issues I see:

  • handling the error, but only if we're on Qubes (platform.linux_distribution is deprecated; platform.uname seems our best bet though really only tells us that we're on a host called dom0)
  • ensuring we don't log to disk when the file is imported as part of the test suite;
  • cleanly unit testing any added code without overcomplicating things.

state = qubes.domains[vm].is_running()
elapsed = time.time() - start_time
if state == expected:
sdlog.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 5 runs, I get the following measures:

  • sys-firewall: [17.94, 27.42, 17.94, 27.45, 17.50]
  • sys-net: [3.38, 3.25, 3.22, 0.00, 0.00]

Do you have any insight into the variance for sys-firewall? Have you need a value of 0.00 for sys-net before?

Copy link
Member Author

@eloquence eloquence Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the API run command completes after the is_running state has already reached False, no polling will be required, and the elapsed time will be 0.00 (which is the best case). I've seen fairly inconsistent behavior when using qvm-run; sometimes it exits immediately, sometimes it aligns nicely with the shutdown sequence. I think we're at the mercy of the scheduler here. Fortunately, all poweroff invocations I've tried have been successful.

@eloquence eloquence marked this pull request as ready for review April 16, 2020 00:27
@eloquence
Copy link
Member Author

Ran this version in Qubes a few times without errors. Like @emkll I'm noticing that polling is often not required after issuing a poweroff (it'll show polling for 0.00 seconds in the log), but it's intended to save our bacon in the cases when it is.

@eloquence
Copy link
Member Author

it's intended to save our bacon in the cases when it is.

Except it maybe doesn't, which raises the question whether the added complexity is worth it. See new report in #498 (comment) . As a good practice, it's probably still best to avoid qvm-kill, but unfortunately it doesn't quite do the trick yet (unless there's something wrong with the logic).

@eloquence
Copy link
Member Author

Per further investigation of #498 it now seems clear that at least some instances of the problem are not caused by our use of qvm-kill, but by a VM (in our observation sd-whonix) starting and crashing without notice to the user, possibly due to memory management issues.

I would still recommend proceeding with the removal of qvm-kill, as it may exacerbate such issues in shutdown and startup if a VM is not properly shut down before it is brought up again. Provided we have no other use for it, we should be able to remove this added polling complexity once Qubes adds a --force parameter to qvm-shutdown, which is already in the master branch for 4.1 per #498 (comment) .

@eloquence
Copy link
Member Author

Since this does not in fact fully resolve #498, we've agreed it can wait until after the next RPM release (0.3.0). Deferring from current sprint (4/22-5/6) to near-term backlog.

@eloquence
Copy link
Member Author

eloquence commented May 5, 2020

@conorsch @emkll Judging by the latest reports on #498, it now seems likely that the qmemman related fixes could resolve the issue. In addition, a --force parameter for qvm-shutdown is expected to ship with Qubes 4.1 per Marek's comment here.

I'm curious how you both feel about the value of this change, with this in mind. I would suggest that we wait for the qmemman fixes to land, re-test, and if we don't observe #498 after that, close this PR and wait for the --force argument to land. Unless you feel that killing VMs is just generally something to avoid ASAP for other reasons, in which case we can add this PR to the next sprint.

@eloquence
Copy link
Member Author

@marmarek We're eager to stop using qvm-kill, but would also prefer to avoid introducing the additional complexity of this polling logic.

You had mentioned here that you could look into backporting the --force arg for qvm-shutdown to 4.0; do y'all have bandwidth to do so in the next few weeks? If not we can bite the bullet and live with this polling solution for a while.

@marmarek
Copy link

marmarek commented May 7, 2020

Yes, can do this week.

@eloquence
Copy link
Member Author

Thanks a lot @marmarek for poking at this.

Leaving a note here that we'll also need to use this --force argument for sys-whonix, in case it is currently in use as a NetVM by the default Tor AppVM that ships with Qubes (anon-whonix).

@eloquence
Copy link
Member Author

I'm going to close this for now until we determine we need it. Would appreciate keeping the branch on the remote for the time being.

@eloquence eloquence closed this Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants