Make server shutdown wait on any Detached handler futures #702

jgallagher · 2023-06-13T22:33:18Z

I ended up using a WaitGroup to track outstanding handler futures instead of trying to accumulate them into something like a FuturesUnordered or JoinSet, because we don't actually care about their results: we just want to know when they're all done.

Testing this was a little weird: we have to force a situation where we have a detached handler future still running after server shutdown has been requested. It uses a similar pattern to the tests added #701, where the handler will wait for a message on a channel from the test driver before returning, allowing us to construct a client request and then cancel it, leaving the detached handler still running.

jgallagher · 2023-06-13T22:36:04Z

dropshot/tests/test_detached_shutdown.rs

+        .await
+        .is_ok()
+    {
+        panic!("server shutdown returned while handler running");


Without the addition of

handler_waitgroup.wait().await;

when constructing join_handle, the test fails here.

jgallagher · 2023-06-13T22:36:49Z

dropshot/tests/test_detached_shutdown.rs

+    release_endpoint_tx.send(()).await.unwrap();
+
+    // Now we can finish waiting for server shutdown.
+    teardown_fut.await;


Without the explict

mem::drop(self.app_state);

in close(), this test hangs forever here, because the wait group is still waiting for the last worker (the one held in DropshotState) to be dropped.

smklein

Thanks for the solid test - always quirky to test these ordering things, but I appreciate that you put in the effort to make this more robust.

smklein · 2023-06-14T17:25:02Z

dropshot/src/server.rs

@@ -12,6 +12,7 @@ use super::router::HttpRouter;
 use super::ProbeRegistration;

 use async_stream::stream;
+use debug_ignore::DebugIgnore;


TIL, this crate seems useful!

davepacheco

Ugh, now the websocket case makes me wonder whether it's a problem that we don't cancel these things now? Do you happen to know what used to happen before #701 if you shut down a server with request handlers running? What if there was a websocket stream attached?

davepacheco · 2023-06-14T23:36:40Z

dropshot/Cargo.toml

@@ -37,6 +38,7 @@ slog-json = "2.6.1"
 slog-term = "2.9.0"
 tokio-rustls = "0.24.0"
 toml = "0.7.4"
+waitgroup = "0.1.2"


This is fine. I see this pulls in two deps. It's a shame if there's nothing in tokio or futures that can already do this. 🤷

davepacheco · 2023-06-14T23:45:26Z

dropshot/src/server.rs

@@ -147,11 +157,13 @@ impl<C: ServerContext> HttpServerStarter<C> {
                        api,
                        private,
                        log,
+                        handler_waitgroup.worker(),


This is probably not something to worry about right now, but: I wonder how debuggable this is, either in situ or post mortem. Like if you walk up to a server that's shutting down and stuck waiting on one of these, would you have any way to know which request it was waiting on? I imagine eventually we'll want to elevate this to an API but that's probably way down the road.

Just to show what I mean, in the past I built something like this:
https://github.com/TritonDataCenter/node-vasync#barrier-coordinate-multiple-concurrent-operations
In that thing, each outstanding operation has a distinct name. You can take the whole object and expose that over a debug HTTP API to ask the server "what are the operations you're waiting on". This proved incredibly useful. But that was more for stuff like our own RSS, where you've got complicated long-running things that could get stuck somewhere. This particular case seems less likely to be a problem. It would just be neat to have a thing like waitgroup but where it was easy to ask it what it was waiting on.

This is probably not something to worry about right now, but: I wonder how debuggable this is, either in situ or post mortem. Like if you walk up to a server that's shutting down and stuck waiting on one of these, would you have any way to know which request it was waiting on?

Probably not! Internally the waitgroup is just a wrapper around an AtomicWaker, which itself is just a glorified AtomicUsize; I think you could pretty quickly find the count, but assuming it's something like 1, I don't know how you'd find the guilty party.

jgallagher · 2023-06-15T00:05:13Z

Ugh, now the websocket case makes me wonder whether it's a problem that we don't cancel these things now? Do you happen to know what used to happen before #701 if you shut down a server with request handlers running? What if there was a websocket stream attached?

I think we did the right thing, at least as far as we can - the web socket upgrade handler happens within the normal handler's future, right? And graceful shutdown waits for all handler futures to complete already.

If in practice most websocket handlers do their own tokio::spawn() and move the upgraded websocket off to a background task, we wouldn't (and still can't) do anything to wait on that.

davepacheco · 2023-06-15T00:12:19Z

Ugh, now the websocket case makes me wonder whether it's a problem that we don't cancel these things now? Do you happen to know what used to happen before #701 if you shut down a server with request handlers running? What if there was a websocket stream attached?

I think we did the right thing, at least as far as we can - the web socket upgrade handler happens within the normal handler's future, right? And graceful shutdown waits for all handler futures to complete already.

If in practice most websocket handlers do their own tokio::spawn() and move the upgraded websocket off to a background task, we wouldn't (and still can't) do anything to wait on that.

The thing I'm (low-level) worried about is that someone starts up a WebSocket for something like a serial console (just as an example of something that's a best-effort background thing) and now the server cannot be shut down. It seems like the consumer would probably want to cancel that rather than wait for the client to determine when they can shut down. The same applies to ordinary HTTP requests but those are usually short-lived and should usually have aggressive timers on them so they seem less likely to cause a problem.

If we previously blocked on WebSockets closing, there's no behavior change here, and this is definitely fine.

jgallagher requested review from sunshowers, davepacheco and smklein June 13, 2023 22:33

jgallagher commented Jun 13, 2023

View reviewed changes

smklein self-assigned this Jun 14, 2023

smklein approved these changes Jun 14, 2023

View reviewed changes

smklein removed their assignment Jun 14, 2023

davepacheco approved these changes Jun 14, 2023

View reviewed changes

Base automatically changed from option-to-spawn-handlers to main June 15, 2023 20:46

Make server shutdown wait on any Detached handler futures

9816cb2

jgallagher force-pushed the detached-graceful-shutdown branch from d0d8b37 to 9816cb2 Compare June 15, 2023 20:50

jgallagher merged commit fc0ffd5 into main Jun 15, 2023

jgallagher deleted the detached-graceful-shutdown branch June 15, 2023 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make server shutdown wait on any Detached handler futures #702

Make server shutdown wait on any Detached handler futures #702

jgallagher commented Jun 13, 2023

jgallagher Jun 13, 2023

jgallagher Jun 13, 2023

smklein left a comment

smklein Jun 14, 2023

davepacheco left a comment

davepacheco Jun 14, 2023

davepacheco Jun 14, 2023

jgallagher Jun 14, 2023

jgallagher commented Jun 15, 2023

davepacheco commented Jun 15, 2023

Make server shutdown wait on any Detached handler futures #702

Make server shutdown wait on any Detached handler futures #702

Conversation

jgallagher commented Jun 13, 2023

jgallagher Jun 13, 2023

Choose a reason for hiding this comment

jgallagher Jun 13, 2023

Choose a reason for hiding this comment

smklein left a comment

Choose a reason for hiding this comment

smklein Jun 14, 2023

Choose a reason for hiding this comment

davepacheco left a comment

Choose a reason for hiding this comment

davepacheco Jun 14, 2023

Choose a reason for hiding this comment

davepacheco Jun 14, 2023

Choose a reason for hiding this comment

jgallagher Jun 14, 2023

Choose a reason for hiding this comment

jgallagher commented Jun 15, 2023

davepacheco commented Jun 15, 2023