client timeouts in tests are undebuggable #1248

davepacheco · 2022-06-22T15:44:41Z

In the ongoing saga to reproduce #1130, I ran into a new failure:

ATTEMPT 1037

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.27391/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.21965.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.27391/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.21965.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'Failed to notify Nexus about new Dataset: Communication Error: error sending request for url (http://127.0.0.1:39312/zpools/345f91d9-8131-4318-b16a-330eafae4445/dataset/c35240aa-3451-41cf-99cb-739bfa0c06db): operation timed out', /home/dap/omicron/sled-agent/src/sim/storage.rs:240:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.27391/.tmp6sZvjD
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)


failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 23.85s

Here's the log:

$ bunyan "/dangerzone/omicron_tmp/try_repro.27391/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.21965.0.log"
[2022-06-22T00:38:45.707557405Z]  INFO: test_disk_create_disk_that_already_exists_fails/21965 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.27391/.tmp6sZvjD
[2022-06-22T00:38:45.708156227Z]  INFO: test_disk_create_disk_that_already_exists_fails/21965 on ivanova: cockroach: copying from seed directory (/home/dap/omicron/target/debug/build/nexus-test-utils-308521ed0d0eed98/out/crdb-base) to storage directory (/dangerzone/omicron_tmp/try_repro.27391/.tmp6sZvjD/data)
...
[2022-06-22T00:38:49.887917249Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/21965 on ivanova: Created Simulated Crucible Server (kind=storage, server=b6d65341-167c-41df-9b5c-41cded99c229, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, address=[::1]:62754)
[2022-06-22T00:38:49.888239995Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/21965 on ivanova: client request (sled_id=b6d65341-167c-41df-9b5c-41cded99c229, body=Some(Body), method=PUT)
    uri: http://127.0.0.1:39312/zpools/345f91d9-8131-4318-b16a-330eafae4445/dataset/c35240aa-3451-41cf-99cb-739bfa0c06db
[2022-06-22T00:38:49.888568574Z] TRACE: 0615efa3-7af5-4ab9-b4dc-f32a93abf0c6/dropshot_internal/21965 on ivanova: incoming request (req_id=0243f5ac-0b1b-49f4-9106-f67f4c434318, method=PUT, remote_addr=127.0.0.1:61564, local_addr=127.0.0.1:39312)
    uri: /zpools/345f91d9-8131-4318-b16a-330eafae4445/dataset/c35240aa-3451-41cf-99cb-739bfa0c06db
[2022-06-22T00:38:49.888920681Z]  INFO: 0615efa3-7af5-4ab9-b4dc-f32a93abf0c6/ServerContext/21965 on ivanova: upserting dataset (address=[::1]:62754, dataset_id=c35240aa-3451-41cf-99cb-739bfa0c06db, zpool_id=345f91d9-8131-4318-b16a-330eafae4445)
[2022-06-22T00:39:08.762905887Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/21965 on ivanova: client response (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
    result: Err(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(39312), path: "/zpools/345f91d9-8131-4318-b16a-330eafae4445/dataset/c35240aa-3451-41cf-99cb-739bfa0c06db", query: None, fragment: None }, source: TimedOut })

We see Nexus receiving the dataset upsert request, then 30 seconds later the client times out and the test fails (presumably because of the client error that panics. I don't have a real stack, but it looks like at the point we panicked, we'd have gotten there via omicron_sled_agent::sim::SledAgent::create_crucible_dataset via nexus_test_utils::resource_helpers::DiskTest::new.

What caused this? I've run into a lot of problems that cause CockroachDB to exit, so in principle it could be any of those. But I don't see any evidence of that in the CockroachDB log files or stdout or stderr files. And we don't have anything else in the test log file. I don't know what caused the timeout, but I imagine we lost most of the evidence when we tore down the test (because we would have torn down Nexus, killed CockroachDB, etc).

What could we do to know more next time? Some thoughts:

If we did dump core on panic rather than unwind and clean up, we'd have the Nexus in-memory state, which would at least tell us whether Nexus was connected to CockroachDB and what else Nexus was doing while handling this request. I think we've run into various issues dumping core on panic though -- it seems to be generally less debuggable for most test failures.
We could have some mechanism that times out very long-running tests. When doing so, it could look at test-related processes and record ptree output, pfiles output, core files, etc. We'd probably have to tune up the client timeout here so that we hit that kind of test timeout instead of the client timeout here.

The text was updated successfully, but these errors were encountered:

davepacheco · 2022-06-22T15:46:48Z

In case someone ever wants to dig deeper into this specific failure, I've attached issue-1248.tgz.zip with the contents of the test-related tmp directories.

davepacheco changed the title ~~client timeout in tests is undebuggable~~ client timeouts in tests are undebuggable Jun 22, 2022

jordanhendricks added the Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) label Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client timeouts in tests are undebuggable #1248

client timeouts in tests are undebuggable #1248

davepacheco commented Jun 22, 2022

davepacheco commented Jun 22, 2022

client timeouts in tests are undebuggable #1248

client timeouts in tests are undebuggable #1248

Comments

davepacheco commented Jun 22, 2022

davepacheco commented Jun 22, 2022