Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client timeouts in tests are undebuggable #1248

Open
davepacheco opened this issue Jun 22, 2022 · 1 comment
Open

client timeouts in tests are undebuggable #1248

davepacheco opened this issue Jun 22, 2022 · 1 comment
Labels
Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more)

Comments

@davepacheco
Copy link
Collaborator

In the ongoing saga to reproduce #1130, I ran into a new failure:

ATTEMPT 1037

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.27391/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.21965.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.27391/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.21965.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'Failed to notify Nexus about new Dataset: Communication Error: error sending request for url (http://127.0.0.1:39312/zpools/345f91d9-8131-4318-b16a-330eafae4445/dataset/c35240aa-3451-41cf-99cb-739bfa0c06db): operation timed out', /home/dap/omicron/sled-agent/src/sim/storage.rs:240:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/try_repro.27391/.tmp6sZvjD
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)


failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 23.85s

Here's the log:

$ bunyan "/dangerzone/omicron_tmp/try_repro.27391/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.21965.0.log"
[2022-06-22T00:38:45.707557405Z]  INFO: test_disk_create_disk_that_already_exists_fails/21965 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.27391/.tmp6sZvjD
[2022-06-22T00:38:45.708156227Z]  INFO: test_disk_create_disk_that_already_exists_fails/21965 on ivanova: cockroach: copying from seed directory (/home/dap/omicron/target/debug/build/nexus-test-utils-308521ed0d0eed98/out/crdb-base) to storage directory (/dangerzone/omicron_tmp/try_repro.27391/.tmp6sZvjD/data)
...
[2022-06-22T00:38:49.887917249Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/21965 on ivanova: Created Simulated Crucible Server (kind=storage, server=b6d65341-167c-41df-9b5c-41cded99c229, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, address=[::1]:62754)
[2022-06-22T00:38:49.888239995Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/21965 on ivanova: client request (sled_id=b6d65341-167c-41df-9b5c-41cded99c229, body=Some(Body), method=PUT)
    uri: http://127.0.0.1:39312/zpools/345f91d9-8131-4318-b16a-330eafae4445/dataset/c35240aa-3451-41cf-99cb-739bfa0c06db
[2022-06-22T00:38:49.888568574Z] TRACE: 0615efa3-7af5-4ab9-b4dc-f32a93abf0c6/dropshot_internal/21965 on ivanova: incoming request (req_id=0243f5ac-0b1b-49f4-9106-f67f4c434318, method=PUT, remote_addr=127.0.0.1:61564, local_addr=127.0.0.1:39312)
    uri: /zpools/345f91d9-8131-4318-b16a-330eafae4445/dataset/c35240aa-3451-41cf-99cb-739bfa0c06db
[2022-06-22T00:38:49.888920681Z]  INFO: 0615efa3-7af5-4ab9-b4dc-f32a93abf0c6/ServerContext/21965 on ivanova: upserting dataset (address=[::1]:62754, dataset_id=c35240aa-3451-41cf-99cb-739bfa0c06db, zpool_id=345f91d9-8131-4318-b16a-330eafae4445)
[2022-06-22T00:39:08.762905887Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/21965 on ivanova: client response (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
    result: Err(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(39312), path: "/zpools/345f91d9-8131-4318-b16a-330eafae4445/dataset/c35240aa-3451-41cf-99cb-739bfa0c06db", query: None, fragment: None }, source: TimedOut })

We see Nexus receiving the dataset upsert request, then 30 seconds later the client times out and the test fails (presumably because of the client error that panics. I don't have a real stack, but it looks like at the point we panicked, we'd have gotten there via omicron_sled_agent::sim::SledAgent::create_crucible_dataset via nexus_test_utils::resource_helpers::DiskTest::new.

What caused this? I've run into a lot of problems that cause CockroachDB to exit, so in principle it could be any of those. But I don't see any evidence of that in the CockroachDB log files or stdout or stderr files. And we don't have anything else in the test log file. I don't know what caused the timeout, but I imagine we lost most of the evidence when we tore down the test (because we would have torn down Nexus, killed CockroachDB, etc).

What could we do to know more next time? Some thoughts:

  • If we did dump core on panic rather than unwind and clean up, we'd have the Nexus in-memory state, which would at least tell us whether Nexus was connected to CockroachDB and what else Nexus was doing while handling this request. I think we've run into various issues dumping core on panic though -- it seems to be generally less debuggable for most test failures.
  • We could have some mechanism that times out very long-running tests. When doing so, it could look at test-related processes and record ptree output, pfiles output, core files, etc. We'd probably have to tune up the client timeout here so that we hit that kind of test timeout instead of the client timeout here.
@davepacheco
Copy link
Collaborator Author

In case someone ever wants to dig deeper into this specific failure, I've attached issue-1248.tgz.zip with the contents of the test-related tmp directories.

@davepacheco davepacheco changed the title client timeout in tests is undebuggable client timeouts in tests are undebuggable Jun 22, 2022
@jordanhendricks jordanhendricks added the Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more) label Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Debugging For when you want better data in debugging an issue (log messages, post mortem debugging, and more)
Projects
None yet
Development

No branches or pull requests

2 participants