Use poweroff + polling instead of qvm-kill for forced shutdowns #534

eloquence · 2020-04-14T07:37:05Z

Status

Towards #498

Test plan

Preparatory steps

Check out this branch in your dev VM.
(Recommended to stay sane, but not required) Apply this patch to your Updater.py (patch Updater.py instant_karma.patch) in order to skip most of the update run, since this PR only impacts the shut-down-and-reboot sequence.
Apply the changes in this PR to the launcher versions in /opt/securedrop/launcher and /srv/salt/launcher (if only the /opt copy is overwritten, the updater itself will replace it on the next run).
tail -f ~/.securedrop_launcher/launcher.log to follow along as you test.

Testing

Run /opt/securedrop/launcher/sdw-launcher.py --skip-delta 0. This forces an updater run.
- Observe that system VMs (sys-usb, sys-firewall, sys-net) are powered off and back on.
- Observe that the updater completes its run without an error including [0.2.3-rpm] libxenlight failed to create new domain sd-log #498 (:crossed_fingers:).
- Observe that SecureDrop VMs and system VMs are up and running at the end of the run.
- Observe that the logs correctly report the forced shutdown and restart of system VMs.
Rinse and repeat a few times.
- Bonus points: Adjust the timeout value in https://github.com/freedomofpress/securedrop-workstation/pull/534/files#diff-ad9e526e0e4fea47e3196b3e293a4d50R511 to something ridiculous like 0.1 and observe if you can now repro [0.2.3-rpm] libxenlight failed to create new domain sd-log #498 (it is expected that you'll now see it again). Caveat that the file needs to again be patched in two places (/opt and srv/) for repeat runs.

Checklist

No packaging implications
make test in dom0 not run yet (but should not be impacted; launcher has its own test suite)

eloquence · 2020-04-14T07:39:52Z

In my own testing, I was not able to get #498 with those changes applied, but I've had that experience before. On the other hand, I was able to see #498 quickly (this time with sys-whonix reporting libxenlight pain) by reducing the timeout in _wait_for_is_running to 1 second.

If this approach looks sound and we still see #498 after it, my next best bet would be to use a similar method for starting VMs.

Towards #498

emkll

Thanks @eloquence , I took a first pass through this, changes look in principle, left some comments inline.

As you already mentioned earlier, coverage still not yet at 100% for Updater.py:

sdw_updater_gui/Updater.py          310     19    94%   22, 481-508, 533-537

emkll · 2020-04-15T13:12:25Z

launcher/sdw_updater_gui/Updater.py

+        qubes.domains[vm].run("poweroff", user="root")
+    except subprocess.CalledProcessError as e:
+        # Exit codes 1 and 143 may occur with successful shutdown; log others
+        if e.returncode != 1 and e.returncode != 143:


How did you chooose these return codes? Should we not log in all cases, in case the error code is unexpected and non-zero?

In my testing I was getting return code 1 almost all the time, and 143 some of the time. I'll confirm if those results hold, but if so, I'm not sure how useful it is to log return code 1.

Testing qvm-run -u root sys-net poweroff a few times, I mostly get exit code 143, and got exit code 1 a few times, without any clear conditional differences; nevertheless the command completed in all cases. poweroff itself triggers SIGTERM, which is what exit code 143 signifies; I'm guessing that its parent shell process may sometimes prematurely terminate. I do not know why the command sometimes returns exit code 1, but I never get exit code 0.

CC'ing @marmarek in case he can shed light on the exit code behavior.

I think poweroff command doesn't return until the system is really off. SIGTERM is probably systemd terminating all the user processes during shutdown. And exit code 1 is qrexec connection being terminated (because of domain shutdown) without sending any exit code first.

That explanation makes sense to me, thanks for jumping in. :)

emkll · 2020-04-15T13:29:56Z

launcher/sdw_updater_gui/Updater.py

+    return _wait_for_is_running(vm, False)
+
+
+def _wait_for_is_running(vm, expected, timeout=60, interval=0.2):


do we expect to reuse this polling method with other VM states? If so, we should consider using get_power_state() function of a domain in the qubes API, which would give us more standard Qubes terminology for power state.

qubes.domains["sys-net"].get_power_state() Running

I'm going to try making the _wait_for function generic, with this signature:

_wait_for(vm, condition, timeout=60, interval=0.2)

That way we can pass in any lambda we want. I personally prefer using a simple Boolean check for is_running but this gives us the flexibility to use different checks for different purposes.

This is now done in 111ccb0, haven't tested in Qubes yet. 111ccb0#diff-ad9e526e0e4fea47e3196b3e293a4d50R508 shows an example invocation.

emkll · 2020-04-15T13:36:29Z

launcher/sdw_updater_gui/Updater.py

+    - False if it did not
+    """
+    start_time = time.time()
+    stop_time = start_time + timeout


have you tested several values for interval here?

No, happy to tweak / test a lower value. The motivation for an interval is to avoid causing undue load, especially during a 10-20 second wait.

emkll · 2020-04-15T13:37:30Z

launcher/sdw_updater_gui/Updater.py

+    while time.time() < stop_time:
+        state = qubes.domains[vm].is_running()
+        elapsed = time.time() - start_time
+        if state == expected:


It seems like using the get_power_state state as describe above might make more sense here, though functionally equal

(See above)

emkll · 2020-04-15T13:40:45Z

launcher/sdw_updater_gui/Updater.py

+
+    qubes = qubesadmin.Qubes()
+except ImportError:
+    qubes = None


We should log something when qubes=None, and probably test these lines

If you have ideas for the error handling here, I'd appreciate guidance/collaboration on the branch. The issues I see:

handling the error, but only if we're on Qubes (platform.linux_distribution is deprecated; platform.uname seems our best bet though really only tells us that we're on a host called dom0)

ensuring we don't log to disk when the file is imported as part of the test suite;

cleanly unit testing any added code without overcomplicating things.

emkll · 2020-04-15T13:59:36Z

launcher/sdw_updater_gui/Updater.py

+        state = qubes.domains[vm].is_running()
+        elapsed = time.time() - start_time
+        if state == expected:
+            sdlog.info(


For 5 runs, I get the following measures:

sys-firewall: [17.94, 27.42, 17.94, 27.45, 17.50]

sys-net: [3.38, 3.25, 3.22, 0.00, 0.00]

Do you have any insight into the variance for sys-firewall? Have you need a value of 0.00 for sys-net before?

If the API run command completes after the is_running state has already reached False, no polling will be required, and the elapsed time will be 0.00 (which is the best case). I've seen fairly inconsistent behavior when using qvm-run; sometimes it exits immediately, sometimes it aligns nicely with the shutdown sequence. I think we're at the mercy of the scheduler here. Fortunately, all poweroff invocations I've tried have been successful.

eloquence · 2020-04-16T01:14:24Z

Ran this version in Qubes a few times without errors. Like @emkll I'm noticing that polling is often not required after issuing a poweroff (it'll show polling for 0.00 seconds in the log), but it's intended to save our bacon in the cases when it is.

eloquence · 2020-04-16T23:41:30Z

it's intended to save our bacon in the cases when it is.

Except it maybe doesn't, which raises the question whether the added complexity is worth it. See new report in #498 (comment) . As a good practice, it's probably still best to avoid qvm-kill, but unfortunately it doesn't quite do the trick yet (unless there's something wrong with the logic).

eloquence · 2020-04-17T16:23:31Z

Per further investigation of #498 it now seems clear that at least some instances of the problem are not caused by our use of qvm-kill, but by a VM (in our observation sd-whonix) starting and crashing without notice to the user, possibly due to memory management issues.

I would still recommend proceeding with the removal of qvm-kill, as it may exacerbate such issues in shutdown and startup if a VM is not properly shut down before it is brought up again. Provided we have no other use for it, we should be able to remove this added polling complexity once Qubes adds a --force parameter to qvm-shutdown, which is already in the master branch for 4.1 per #498 (comment) .

eloquence · 2020-04-23T00:25:24Z

Since this does not in fact fully resolve #498, we've agreed it can wait until after the next RPM release (0.3.0). Deferring from current sprint (4/22-5/6) to near-term backlog.

eloquence · 2020-05-05T19:12:02Z

@conorsch @emkll Judging by the latest reports on #498, it now seems likely that the qmemman related fixes could resolve the issue. In addition, a --force parameter for qvm-shutdown is expected to ship with Qubes 4.1 per Marek's comment here.

I'm curious how you both feel about the value of this change, with this in mind. I would suggest that we wait for the qmemman fixes to land, re-test, and if we don't observe #498 after that, close this PR and wait for the --force argument to land. Unless you feel that killing VMs is just generally something to avoid ASAP for other reasons, in which case we can add this PR to the next sprint.

eloquence · 2020-05-06T18:53:15Z

@marmarek We're eager to stop using qvm-kill, but would also prefer to avoid introducing the additional complexity of this polling logic.

You had mentioned here that you could look into backporting the --force arg for qvm-shutdown to 4.0; do y'all have bandwidth to do so in the next few weeks? If not we can bite the bullet and live with this polling solution for a while.

marmarek · 2020-05-07T02:02:03Z

Yes, can do this week.

eloquence · 2020-05-19T00:10:51Z

Thanks a lot @marmarek for poking at this.

Leaving a note here that we'll also need to use this --force argument for sys-whonix, in case it is currently in use as a NetVM by the default Tor AppVM that ships with Qubes (anon-whonix).

eloquence · 2020-06-26T22:29:52Z

I'm going to close this for now until we determine we need it. Would appreciate keeping the branch on the remote for the time being.

eloquence added the blocked label Apr 14, 2020

eloquence removed the blocked label Apr 15, 2020

eloquence added 2 commits April 14, 2020 17:57

Use poweroff + poll instead of qvm-kill for forced shutdowns

b79250f

Towards #498

Fix typo in comment

47c6a3c

eloquence force-pushed the 498-thou-shalt-not branch from c0b0a13 to 47c6a3c Compare April 15, 2020 00:57

eloquence changed the base branch from 531-long-live-sys-usb to master April 15, 2020 00:57

Add first test for VM polling

3de355f

emkll reviewed Apr 15, 2020

View reviewed changes

eloquence added 2 commits April 15, 2020 12:24

Make polling code more generic, add tests

111ccb0

Complete test coverage

dc372ce

eloquence marked this pull request as ready for review April 16, 2020 00:27

Explain why we accept return codes 1 and 143

b0154ba

eloquence mentioned this pull request Apr 16, 2020

[0.2.3-rpm] libxenlight failed to create new domain sd-log #498

Closed

eloquence closed this Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use poweroff + polling instead of qvm-kill for forced shutdowns #534

Use poweroff + polling instead of qvm-kill for forced shutdowns #534

eloquence commented Apr 14, 2020 •

edited

Loading

eloquence commented Apr 14, 2020

emkll left a comment

emkll Apr 15, 2020

eloquence Apr 15, 2020

eloquence Apr 15, 2020

marmarek Apr 16, 2020

eloquence Apr 16, 2020

emkll Apr 15, 2020

eloquence Apr 15, 2020

eloquence Apr 15, 2020

emkll Apr 15, 2020

eloquence Apr 15, 2020

emkll Apr 15, 2020

eloquence Apr 15, 2020

emkll Apr 15, 2020

eloquence Apr 16, 2020

emkll Apr 15, 2020

eloquence Apr 15, 2020 •

edited

Loading

eloquence commented Apr 16, 2020

eloquence commented Apr 16, 2020

eloquence commented Apr 17, 2020

eloquence commented Apr 23, 2020

eloquence commented May 5, 2020 •

edited

Loading

eloquence commented May 6, 2020

marmarek commented May 7, 2020

eloquence commented May 19, 2020

eloquence commented Jun 26, 2020

		return _wait_for_is_running(vm, False)


		def _wait_for_is_running(vm, expected, timeout=60, interval=0.2):

Use poweroff + polling instead of qvm-kill for forced shutdowns #534

Use poweroff + polling instead of qvm-kill for forced shutdowns #534

Conversation

eloquence commented Apr 14, 2020 • edited Loading

Status

Test plan

Preparatory steps

Testing

Checklist

eloquence commented Apr 14, 2020

emkll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eloquence Apr 15, 2020 • edited Loading

Choose a reason for hiding this comment

eloquence commented Apr 16, 2020

eloquence commented Apr 16, 2020

eloquence commented Apr 17, 2020

eloquence commented Apr 23, 2020

eloquence commented May 5, 2020 • edited Loading

eloquence commented May 6, 2020

marmarek commented May 7, 2020

eloquence commented May 19, 2020

eloquence commented Jun 26, 2020

eloquence commented Apr 14, 2020 •

edited

Loading

eloquence Apr 15, 2020 •

edited

Loading

eloquence commented May 5, 2020 •

edited

Loading