Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make server shutdown wait on any Detached handler futures #702

Merged
merged 1 commit into from
Jun 15, 2023

Conversation

jgallagher
Copy link
Contributor

I ended up using a WaitGroup to track outstanding handler futures instead of trying to accumulate them into something like a FuturesUnordered or JoinSet, because we don't actually care about their results: we just want to know when they're all done.

Testing this was a little weird: we have to force a situation where we have a detached handler future still running after server shutdown has been requested. It uses a similar pattern to the tests added #701, where the handler will wait for a message on a channel from the test driver before returning, allowing us to construct a client request and then cancel it, leaving the detached handler still running.

.await
.is_ok()
{
panic!("server shutdown returned while handler running");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the addition of

handler_waitgroup.wait().await;

when constructing join_handle, the test fails here.

release_endpoint_tx.send(()).await.unwrap();

// Now we can finish waiting for server shutdown.
teardown_fut.await;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the explict

mem::drop(self.app_state);

in close(), this test hangs forever here, because the wait group is still waiting for the last worker (the one held in DropshotState) to be dropped.

@smklein smklein self-assigned this Jun 14, 2023
Copy link
Contributor

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the solid test - always quirky to test these ordering things, but I appreciate that you put in the effort to make this more robust.

@@ -12,6 +12,7 @@ use super::router::HttpRouter;
use super::ProbeRegistration;

use async_stream::stream;
use debug_ignore::DebugIgnore;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, this crate seems useful!

@smklein smklein removed their assignment Jun 14, 2023
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, now the websocket case makes me wonder whether it's a problem that we don't cancel these things now? Do you happen to know what used to happen before #701 if you shut down a server with request handlers running? What if there was a websocket stream attached?

@@ -37,6 +38,7 @@ slog-json = "2.6.1"
slog-term = "2.9.0"
tokio-rustls = "0.24.0"
toml = "0.7.4"
waitgroup = "0.1.2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. I see this pulls in two deps. It's a shame if there's nothing in tokio or futures that can already do this. 🤷

@@ -147,11 +157,13 @@ impl<C: ServerContext> HttpServerStarter<C> {
api,
private,
log,
handler_waitgroup.worker(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not something to worry about right now, but: I wonder how debuggable this is, either in situ or post mortem. Like if you walk up to a server that's shutting down and stuck waiting on one of these, would you have any way to know which request it was waiting on? I imagine eventually we'll want to elevate this to an API but that's probably way down the road.

Just to show what I mean, in the past I built something like this:
https://github.com/TritonDataCenter/node-vasync#barrier-coordinate-multiple-concurrent-operations
In that thing, each outstanding operation has a distinct name. You can take the whole object and expose that over a debug HTTP API to ask the server "what are the operations you're waiting on". This proved incredibly useful. But that was more for stuff like our own RSS, where you've got complicated long-running things that could get stuck somewhere. This particular case seems less likely to be a problem. It would just be neat to have a thing like waitgroup but where it was easy to ask it what it was waiting on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not something to worry about right now, but: I wonder how debuggable this is, either in situ or post mortem. Like if you walk up to a server that's shutting down and stuck waiting on one of these, would you have any way to know which request it was waiting on?

Probably not! Internally the waitgroup is just a wrapper around an AtomicWaker, which itself is just a glorified AtomicUsize; I think you could pretty quickly find the count, but assuming it's something like 1, I don't know how you'd find the guilty party.

@jgallagher
Copy link
Contributor Author

Ugh, now the websocket case makes me wonder whether it's a problem that we don't cancel these things now? Do you happen to know what used to happen before #701 if you shut down a server with request handlers running? What if there was a websocket stream attached?

I think we did the right thing, at least as far as we can - the web socket upgrade handler happens within the normal handler's future, right? And graceful shutdown waits for all handler futures to complete already.

If in practice most websocket handlers do their own tokio::spawn() and move the upgraded websocket off to a background task, we wouldn't (and still can't) do anything to wait on that.

@davepacheco
Copy link
Collaborator

Ugh, now the websocket case makes me wonder whether it's a problem that we don't cancel these things now? Do you happen to know what used to happen before #701 if you shut down a server with request handlers running? What if there was a websocket stream attached?

I think we did the right thing, at least as far as we can - the web socket upgrade handler happens within the normal handler's future, right? And graceful shutdown waits for all handler futures to complete already.

If in practice most websocket handlers do their own tokio::spawn() and move the upgraded websocket off to a background task, we wouldn't (and still can't) do anything to wait on that.

The thing I'm (low-level) worried about is that someone starts up a WebSocket for something like a serial console (just as an example of something that's a best-effort background thing) and now the server cannot be shut down. It seems like the consumer would probably want to cancel that rather than wait for the client to determine when they can shut down. The same applies to ordinary HTTP requests but those are usually short-lived and should usually have aggressive timers on them so they seem less likely to cause a problem.

If we previously blocked on WebSockets closing, there's no behavior change here, and this is definitely fine.

Base automatically changed from option-to-spawn-handlers to main June 15, 2023 20:46
@jgallagher jgallagher merged commit fc0ffd5 into main Jun 15, 2023
@jgallagher jgallagher deleted the detached-graceful-shutdown branch June 15, 2023 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants