add "live tests" #6427

davepacheco · 2024-08-23T22:03:43Z

(some of this text is copied from the new README)

This PR adds a new Omicron package called omicron-live-tests to contain automated tests that operate in the context of an already-deployed "real" Oxide system (e.g., a4x2 or our london or madrid test environments). The motivation is to provide a home for automated tests for all kinds of Reconfigurator behavior (e.g., add/expunge of all zones, add/expunge sled, upgrades, etc.), though it could be used for non-Reconfigurator behavior, too.

What makes these tests different from the rest of the test suite is that they require connectivity to the underlay network of the deployed system and they make API calls to various components in that system and they assume that this will behave like a real production system. By contrast, the normal tests instead set up a bunch of components using simulated sled agents and localhost networking, which is great for starting from a predictable state and running tests in parallel, but the simulated sled agents and networking make it impossible to exercise quite a lot of Reconfigurator's functionality.

There are some safeguards so that these tests won't run on production systems: they refuse to run if they find any Oxide-hardware sleds in the system whose serial numbers don't correspond to known test environments.

This is similar to end-to-end-tests in a lot of ways but I didn't think it made sense to use the same package because the test environment is pretty different. The end-to-end-tests run on a Helios system on which we have deployed Sled Agent and a bunch of components in zones. But there aren't multiple sleds and there isn't a real underlay network. I don't think it's faithful enough to carry out a lot of the Reconfigurator tests that we'd like to do.

Like the end-to-end-tests, this package is not built or tested by default because the tests generally can't work in a dev environment and there's no way to have cargo build and check them but not run the tests by default.

Eventually I hope we can find a way to run these tests automatically in CI, but that's future work. For now, there are instructions for running these by hand on an Omicron system. Start by running cargo xtask live-tests to build an archive and then follow the instructions:

$ cargo xtask live-tests
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.96s
     Running `target/debug/xtask live-tests`
using temporary directory: /dangerzone/omicron_tmp/.tmp0ItZUD
will create archive file:  /dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive/omicron-live-tests.tar.zst
output tarball:            /home/dap/omicron-work/target/live-tests-archive.tgz

running: /home/dap/.rustup/toolchains/1.80.1-x86_64-unknown-illumos/bin/cargo "nextest" "archive" "--package" "omicron-live-tests" "--archive-file" "/dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive/omicron-live-tests.tar.zst"
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.89s
info: experimental features enabled: setup-scripts
   Archiving 1 binary, 1 build script output directory, and 1 linked path to /dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive/omicron-live-tests.tar.zst
    Archived 35 files to /dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive/omicron-live-tests.tar.zst in 0.31s
running: bash "-c" "tar cf - Cargo.toml .config/nextest.toml live-tests | tar xf - -C \"/dangerzone/omicron_tmp/.tmp0ItZUD/live-tests-archive\""
running: tar "cf" "/home/dap/omicron-work/target/live-tests-archive.tgz" "-C" "/dangerzone/omicron_tmp/.tmp0ItZUD" "live-tests-archive"
created: /home/dap/omicron-work/target/live-tests-archive.tgz

To use this:

1. Copy the tarball to the switch zone in a deployed Omicron system.

     e.g., scp \
              /home/dap/omicron-work/target/live-tests-archive.tgz \
              root@YOUR_SCRIMLET_GZ_IP:/zone/oxz_switch/root/root

2. Copy the `cargo-nextest` binary to the same place.

     e.g., scp \
              $(which cargo-nextest) \
              root@YOUR_SCRIMLET_GZ_IP:/zone/oxz_switch/root/root

3. On that system, unpack the tarball with:

     tar xzf live-tests-archive.tgz

4. On that system, run tests with:

     TMPDIR=/var/tmp ./cargo-nextest nextest run \
         --archive-file live-tests-archive/omicron-live-tests.tar.zst \
         --workspace-remap live-tests-archive

Follow the instructions, run the tests, and you'll see the usual nextest-style output:

root@oxz_switch:~# TMPDIR=/var/tmp ./cargo-nextest nextest run          --archive-file live-tests-archive/omicron-live-tests.tar.zst          --workspace-remap live-tests-archive
  Extracting 1 binary, 1 build script output directory, and 1 linked path to /var/tmp/nextest-archive-Lqx9VZ
   Extracted 35 files to /var/tmp/nextest-archive-Lqx9VZ in 1.01s
info: experimental features enabled: setup-scripts
    Starting 1 test across 1 binary (run ID: a5fc9163-9dd5-4b23-b89f-55f8f39ebbbc, nextest profile: default)
        SLOW [> 60.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        PASS [  61.975s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
------------
     Summary [  61.983s] 1 test run: 1 passed (1 slow), 0 skipped
root@oxz_switch:~#

…r, more accurate tarball)

…has nextest)

jgallagher

Looks great! Just a few questions

jgallagher · 2024-08-26T19:31:58Z

live-tests/macros/src/lib.rs

+        {
+            #input_func
+
+            let ctx = crate::common::LiveTestContext::new(


Just confirming - the use of crate:: here means this macro is only useful within the live-tests crate, right? (Or technically some other crate that defines this same module / type / method, but ignoring that pathology...) I'd maybe note that in the docstring.

Rewrote this comment in c7cd0ed.

jgallagher · 2024-08-26T19:34:43Z

live-tests/macros/src/lib.rs

+///
+/// We use this instead of implementing Drop on LiveTestContext because we want
+/// the teardown to only happen when the test doesn't fail (which causes a panic
+/// and unwind).


How do you feel about an API like

LiveTestContext::with_context("test_name", |lc: &mut LiveTestContext| { // ... my test ... })

which gives the same opportunity to clean up only on success without needing a proc macro? It's certainly a little uglier and causes a level of indentation. I'd live with that but maybe I dislike proc macros more than most.

I don't have a strong feeling either way but I built this to parallel nexus_test and I think there's some benefit to consistency.

Hah, that's fair; a big part of my comment was because I don't love nexus_test because it's hard to find exactly what it's doing (IMO).

I went down this path in a follow-on branch:
https://github.com/oxidecomputer/omicron/compare/dap/experiments/live-tests-no-macro

A few notes:

I wanted to avoid type parameters in LiveTestContext::run_test because I didn't want it monomorphized for every test that we add. So I had run_test accept a &dyn Fn instead of F: Fn and that closure returns a BoxFuture.

rustfmt seems to have lost its mind on the code and it looks awful. I have spent no time trying to figure out why.

It doesn't quite compile due to a lifetime issue. I gave up (not at all insurmountable, I was just spending more time on it than I had hoped to for just an experiment).

There are a couple of problems:

the rustfmt problem, but I'll assume that is surmountable

it's much easier to copy/paste the code and get the wrong test_name in the argument

[tokio::test] and so [live_test] allow your test function to return nothing at all or a Result type. I had this version return a Result. It would be tricky to support both here. (But most of our tests probably just return () and we could probably just have it support that.)

In summary, I don't like it better but I'm still open to it if people really do!

Oh thanks! Looking at 3f43f78:

The rustfmt issue is because of the additional nesting involved + our shorter-than-usual 80 char limit

I think monomorphizing that is okay? The function looks pretty small to me, and the dependent functions aren't generic.

Passing in borrowed data to closures is difficult. async closures solve this but aren't stable yet. I ran into this with the update engine, and that has to pass an owned context in for that reason.

jgallagher · 2024-08-26T19:51:05Z

live-tests/tests/common/mod.rs

+    opctx: &OpContext,
+    datastore: &DataStore,
+) -> Result<(), anyhow::Error> {
+    const ALLOWED_GIMLET_SERIALS: &[&str] = &[


Would it make sense to take extra allowed serials as input at runtime, maybe via an env var? It would be annoying to get this all built and transferred over to madrid/london, then fail because one of the gimlets had been swapped out for a new one.

It could. For what it's worth, when I touch this file and rebuild, it takes 13.5s. You do have the extra copying step(s) that are also annoying.

jgallagher · 2024-08-26T20:05:40Z

live-tests/tests/test_nexus_add_remove.rs

+//
+// - that after adding:
+//   - the new Nexus appears in external DNS
+//   - we can _use_ the new Nexus from the outside


Can we connect to an external IP from within the switch zone? Agreed on the other points (checking external DNS before and after).

Possibly not today. I suspect figuring this out will be worth the investment. Maybe we can use the techport interface for the external API?

We could, but I'm not sure it gains us much. If we're in the switch zone we can either connect to the TCP proxy wicketd runs or directly to the "Nexus external API exposed on the underlay" that wicketd forwards to. But in either case, it wouldn't really confirm external connectivity, because it's not actually using the external IP. It does confirm the new Nexus would be reachable via the techport, which is something though.

sunshowers

This is so cool, thanks for doing it!

.config/nextest.toml

sunshowers · 2024-08-26T20:25:54Z

.config/nextest.toml

+# While most Omicron tests operate with their own simulated control plane, the
+# live-tests operate on a more realistic, shared control plane and test
+# behaviors that conflict with each other.  They need to be run serially.
+live-tests = { max-threads = 1 }


Makes sense!

As mentioned in the DM, you can now filter out omicron-live-tests from the default set. (You'll want to bump the required and recommended versions in this file, as well as the version used in CI, to 0.9.76)

This is cool. I'd prefer we do that in a separate PR.

dev-tools/xtask/src/live_tests.rs

sunshowers · 2024-08-26T20:29:44Z

live-tests/tests/common/mod.rs

+                    "WARNING: temporary directory appears to be under /tmp, \
+                     which is generally tmpfs.  Consider setting \
+                     TMPDIR=/var/tmp to avoid runaway tests using too much\
+                     memory and swap."


Should we check for a minimum amount of free space required for TMPDIR?

I don't think so. Management of tmp space seems outside the scope of all this. I almost didn't bother with this warning at all except that the failure mode of running out of space in /tmp (i.e., swap) is usually pretty bad for the whole zone, and that could be pretty annoying to clean up in the switch zone. That said, I'm barely pro for even this warning -- I could be convinced to just remove this altogether.

Also: there are so many ways I've seen that kind of check be wrong, in part because it's really hard to know how much physical disk space something is going to need before it writes it. Example: false positive errors when using zfs with compression=on because a thing thinks it needs a GiB even though after compression it'll only use 150 MiB of space. Plus other things can use space in the meantime or space can free up in the meantime.

sunshowers · 2024-08-26T20:30:22Z

live-tests/tests/common/mod.rs

+    // We could also just go ahead and use /var/tmp, but it's not clear we can
+    // reliably do that at this point (if Rust or other components have cached
+    // TMPDIR) and it would be hard to override.


Interesting, you could use a setup script to set TMPDIR outside of the test process.

Does the test process inherit the environment of the setup script?

Yes, if you write it out to $NEXTEST_ENV: https://nexte.st/docs/configuration/setup-scripts/#environment-variables

(This is how the crdb-seed setup script works)

davepacheco · 2024-08-27T22:42:38Z

I was hoping to land this at a point where the live tests built with this branch passed against a deployment built from the tip of this branch. But I can't keep up with "main" today -- it's moving faster than I can iterate. I have run the live tests built with ab5448b (current tip of this PR) against e85d012 from this branch, which is approximately main at 758818a. Although that's only about 6 hours old, it's 8 commits behind.

davepacheco · 2024-08-27T23:32:45Z

Okay, from the tip of this branch, locally, I merged with main @ 3dcf1e766c4d44c56df5ce80f3180fbbea00de4a, deployed a system with that, and ran the live-tests with that and that worked, too.

davepacheco added 16 commits August 20, 2024 19:52

first cut

0cd7235

add CI that builds it

ec5fadb

avoid trying to run tests by default (does not seem like a great way)

4544934

bail out faster on non-illumos

8f1cc51

Bundle up working directory instead of using git archive (much smalle…

340fc22

…r, more accurate tarball)

check TMPDIR

7c2c50c

these tests need to run serially

210e4f6

commonize some code

365d13e

DataStore::new_failfast

4712076

commonize run_subcmd in xtask

a27d14d

write some docs

c4f9ee7

make it a macro

826b278

fix bugs, clean up

aed5ea2

test that the zone goes away

9d959db

live-test -> live-tests

f051594

add some docs

6dbf1ff

davepacheco requested review from jgallagher, andrewjstone and sunshowers August 23, 2024 22:03

davepacheco self-assigned this Aug 23, 2024

davepacheco marked this pull request as draft August 23, 2024 22:06

davepacheco added 3 commits August 23, 2024 15:12

fix clippy

cf8fb0e

fix workspace dep check

a9d0a32

fix hakari

80922f6

davepacheco marked this pull request as ready for review August 23, 2024 22:20

davepacheco added 3 commits August 23, 2024 15:23

Merge branch 'main' into dap/drafts/reconfigurator-tests

502d229

fix live-tests check (move this into the existing check that already …

9288eb0

…has nextest)

cannot build that on non-illumos

9bbb273

jgallagher approved these changes Aug 26, 2024

View reviewed changes

sunshowers reviewed Aug 26, 2024

View reviewed changes

davepacheco added 5 commits August 26, 2024 14:03

review feedback

c7cd0ed

Merge branch 'main' into dap/drafts/reconfigurator-tests

78f99db

Merge branch 'main' into dap/drafts/reconfigurator-tests

e85d012

Merge branch 'main' into dap/drafts/reconfigurator-tests

251e665

fix semantic mismerge

ab5448b

rustfmt

665bcc1

davepacheco enabled auto-merge (squash) August 27, 2024 23:32

davepacheco merged commit 9f4ba06 into main Aug 28, 2024
24 checks passed

davepacheco deleted the dap/drafts/reconfigurator-tests branch August 28, 2024 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add "live tests" #6427

add "live tests" #6427

davepacheco commented Aug 23, 2024

jgallagher left a comment

jgallagher Aug 26, 2024

davepacheco Aug 26, 2024

jgallagher Aug 26, 2024

davepacheco Aug 26, 2024

jgallagher Aug 27, 2024

davepacheco Aug 28, 2024 •

edited

Loading

sunshowers Aug 28, 2024 •

edited

Loading

jgallagher Aug 26, 2024

davepacheco Aug 26, 2024

jgallagher Aug 26, 2024

davepacheco Aug 26, 2024

jgallagher Aug 27, 2024

sunshowers left a comment

sunshowers Aug 26, 2024

davepacheco Aug 26, 2024

sunshowers Aug 26, 2024

davepacheco Aug 26, 2024

sunshowers Aug 26, 2024

davepacheco Aug 26, 2024

sunshowers Aug 27, 2024

davepacheco commented Aug 27, 2024

davepacheco commented Aug 27, 2024

add "live tests" #6427

add "live tests" #6427

Conversation

davepacheco commented Aug 23, 2024

jgallagher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

sunshowers Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunshowers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davepacheco commented Aug 27, 2024

davepacheco commented Aug 27, 2024

davepacheco Aug 28, 2024 •

edited

Loading

sunshowers Aug 28, 2024 •

edited

Loading