You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the ongoing saga to reproduce #1130, I ran into a new failure:
ATTEMPT 1037
running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED
failures:
---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.27391/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.21965.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.27391/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.21965.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'Failed to notify Nexus about new Dataset: Communication Error: error sending request for url (http://127.0.0.1:39312/zpools/345f91d9-8131-4318-b16a-330eafae4445/dataset/c35240aa-3451-41cf-99cb-739bfa0c06db): operation timed out', /home/dap/omicron/sled-agent/src/sim/storage.rs:240:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.27391/.tmp6sZvjD
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
failures:
integration_tests::disks::test_disk_create_disk_that_already_exists_fails
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 23.85s
We see Nexus receiving the dataset upsert request, then 30 seconds later the client times out and the test fails (presumably because of the client error that panics. I don't have a real stack, but it looks like at the point we panicked, we'd have gotten there via omicron_sled_agent::sim::SledAgent::create_crucible_dataset via nexus_test_utils::resource_helpers::DiskTest::new.
What caused this? I've run into a lot of problems that cause CockroachDB to exit, so in principle it could be any of those. But I don't see any evidence of that in the CockroachDB log files or stdout or stderr files. And we don't have anything else in the test log file. I don't know what caused the timeout, but I imagine we lost most of the evidence when we tore down the test (because we would have torn down Nexus, killed CockroachDB, etc).
What could we do to know more next time? Some thoughts:
If we did dump core on panic rather than unwind and clean up, we'd have the Nexus in-memory state, which would at least tell us whether Nexus was connected to CockroachDB and what else Nexus was doing while handling this request. I think we've run into various issues dumping core on panic though -- it seems to be generally less debuggable for most test failures.
We could have some mechanism that times out very long-running tests. When doing so, it could look at test-related processes and record ptree output, pfiles output, core files, etc. We'd probably have to tune up the client timeout here so that we hit that kind of test timeout instead of the client timeout here.
The text was updated successfully, but these errors were encountered:
In case someone ever wants to dig deeper into this specific failure, I've attached issue-1248.tgz.zip with the contents of the test-related tmp directories.
davepacheco
changed the title
client timeout in tests is undebuggable
client timeouts in tests are undebuggable
Jun 22, 2022
jordanhendricks
added
the
Debugging
For when you want better data in debugging an issue (log messages, post mortem debugging, and more)
label
Aug 11, 2023
In the ongoing saga to reproduce #1130, I ran into a new failure:
Here's the log:
We see Nexus receiving the dataset upsert request, then 30 seconds later the client times out and the test fails (presumably because of the client error that panics. I don't have a real stack, but it looks like at the point we panicked, we'd have gotten there via
omicron_sled_agent::sim::SledAgent::create_crucible_dataset
vianexus_test_utils::resource_helpers::DiskTest::new
.What caused this? I've run into a lot of problems that cause CockroachDB to exit, so in principle it could be any of those. But I don't see any evidence of that in the CockroachDB log files or stdout or stderr files. And we don't have anything else in the test log file. I don't know what caused the timeout, but I imagine we lost most of the evidence when we tore down the test (because we would have torn down Nexus, killed CockroachDB, etc).
What could we do to know more next time? Some thoughts:
The text was updated successfully, but these errors were encountered: