Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverseproxy: do not close streams when explicitly configured #5637

Closed
wants to merge 1 commit into from

Conversation

WeidiDeng
Copy link
Member

@WeidiDeng WeidiDeng commented Jul 18, 2023

stream_close_delay of -1 now means streams will not be closed by caddy.

@francislavoie
Copy link
Member

🤔 I'm not sure why we would want to allow this, honestly. Streams never closing means the reference to the old config is never dropped. If anything, it should be set to a big number to allow for naturally closing, not -1 to never close.

@WeidiDeng
Copy link
Member Author

But large timeouts will mean the timer and associated config be referenced during the timeout.

Besides, user will have to configure if themselves.

@francislavoie
Copy link
Member

francislavoie commented Jul 18, 2023

But with a timer you have a guarantee that memory will be cleaned up at some point. With -1 you have no guarantee and you're essentially asking for memory leaks. Definite foot-gun IMO.

Do you have an actual usecase for this? What problem would this solve? I don't think we should add this until we have a good reason to allow it.

@mholt
Copy link
Member

mholt commented Jul 18, 2023

I mean, the more I think about it, the more I wonder whether we need that timeout. Especially since we're now allowing streams to survive the config reloads.

Since it's just a stream proxy, either the client or the backend is really responsible for terminating the stream.

@francislavoie
Copy link
Member

francislavoie commented Jul 18, 2023

Huh? We absolutely do need the close delay. That was the whole point of #5567. It's to provide a way to keep the connections alive for a certain amount of time past config reloads to allow connections to close naturally rather than being forcefully closed, if possible, but still guarantee they're all cleaned up and memory is properly garbage collected.

@WeidiDeng
Copy link
Member Author

🤔 . Seems it already handles when all connections are closed before timer fires. So large timeouts will not cause timer to linger for more than necessary. Closing now unless a situation where big timeout can't solve occurs.

@WeidiDeng WeidiDeng closed this Jul 18, 2023
@francislavoie francislavoie deleted the reverse-proxy-keep-alive branch July 18, 2023 02:56
@mholt
Copy link
Member

mholt commented Jul 18, 2023

Sorry; to clarify, let me summarize:

  • Originally we were closing the streams on a config reload to ensure there were no connections using old configs that were left alive, forgotten about, and thus leaked.
  • We're now allowing stream (hijacked) connections to survive config reloads -- if configured -- so that they don't interrupt the client experience. But we still time them out so we can clean up the resources later.
  • Thinking on this more, it may not be our job to time out the connections at all. We are only a proxy. The backend and frontend should be closing those connections for us. Sure, it may be a "leak" if no timer is set, but the backend's job is to close those connections. If there's a leak, it's the backend's fault.

I actually kind of wish there was a way for the proxy, the backend, and the frontend to communicate, like, for the proxy to say: "Hey, I need to close this. If you still need this, can you re-establish a new connection without throwing a fit?" And the frontend could be like "Yeah bro, I'm cool wit dat," and the server could be like, "Sure. Whatever."

Anyway, that's probably wishful thinking.

So, maybe in 2.7, the situation should be:

  • By default, close hijacked connections at reload. This still offers the best resource management and prevents old connections from running old configs; basically it ensures the new config is being run. (Maybe we can change this default later.)
  • A stream close timeout may be set to allow streams to survive config reloads, until the timer fires.
  • A stream close timeout may be set to -1 to allow streams to survive config reloads indefinitely. Yes, this may sort of be considered a leak, but so might the previous scenario where the connection survives a reload for a certain amount of time. In practice, backends shouldn't allow connections indefinitely anyway, especially idle connections. If they do, that's not our problem 🙃

Does that make sense?

@francislavoie
Copy link
Member

francislavoie commented Jul 18, 2023

If there's a leak, it's the backend's fault.

I totally disagree with this. The frontend and backend have no incentive to close the connection to make any proxy happy. They'll keep it open as long as reasonably possible, they're greedy. For that reason, as a proxy, we can't trust that either of them will ever close the connection.

Imagine a (not-quite-worst-case) scenario where Caddy is reloaded say every five minutes, but websocket connections can stay open for say 24 hours at worst. If every time the config is reloaded, a new websocket connection comes in and stays open 24 hours, then that's 12 per hour -> 288 per day. So at worst, Caddy could be forced to keep 288 copies of its config in memory. That's obviously ridiculous and could cause OOM etc depending on what else is in the config.

(Also, yes a websocket connection could be open 24hrs+, for example if a user opens a browser tab then just leaves it sitting in the background, it'll never disconnect unless forced to, and will probably just be sending heartbeats back and forth forever as long as the connection stays healthy. I see it all the time in my apps.)

"Hey, I need to close this. If you still need this, can you re-establish a new connection without throwing a fit?"

That is what we're doing though. That's why we implemented the websocket close frame. Clients are expected to reconnect when that happens. But that's still disruptive to user experience and can cause thundering herds if all the reconnects are at the exact same time.

There's no other viable option at our disposal. There's no standard that exists for allowing a proxy to inject information into websocket streams. The traffic can be raw binary (protobuf or w/e), it can be JSON text, etc. We can't reasonably try to invent something either, that's not our place. Who would actually want to implement a Caddy-specific websocket traffic thing in their frontends and backends? I certainly don't, as someone who uses Caddy for this particular thing.

By default, close hijacked connections at reload.

Yep, still the case.

A stream close timeout may be set to allow streams to survive config reloads, until the timer fires.

Yep, that's what we have. I called it a "stream close delay", because there's we also added a "stream timeout" which is an absolute timeout from the point where the stream is opened, rather than from when the server wanted to try to close it.

A stream close timeout may be set to -1 to allow streams to survive config reloads indefinitely.

But again as I asked earlier, why? What's the usecase for this? Who would need this? What problem would it solve? I'm not seeing it at all.

I don't think it makes sense to provide a config option that obviously and certainly will cause memory leaks. That seems counterproductive.

@mholt
Copy link
Member

mholt commented Jul 18, 2023

@francislavoie Ok -- yeah, I see what you're saying. Let me try to explain a little more why this might not be a problem (always):

The frontend and backend have no incentive to close the connection to make any proxy happy. They'll keep it open as long as reasonably possible, they're greedy.

I don't think this is true of the backend. If it leaves all connections open forever, even if idle, it'll run out of resources too. It's accepting lots of client connections. So clients, sure -- they generally only have ~1 connection each. But servers have to scale resources to serve their clients.

Imagine a (not-quite-worst-case) scenario where Caddy is reloaded say every five minutes, but websocket connections can stay open for say 24 hours at worst. If every time the config is reloaded, a new websocket connection comes in and stays open 24 hours, then that's 12 per hour -> 288 per day. So at worst, Caddy could be forced to keep 288 copies of its config in memory. That's obviously ridiculous and could cause OOM etc depending on what else is in the config.

I'm not actually sure that the entire config is in memory just because the connection is open. Something or a part of it is definitely in memory (like the reverse proxy part), but I don't know if the whole thing is. I'd have to do some GC analysis, but I'm not quite sure how to do that. 🤔

But again as I asked earlier, why? What's the usecase for this? Who would need this? What problem would it solve? I'm not seeing it at all.

Ok yes, valid questions. I think, less so than use case and market, it's more of properly aligning the roles of the proxy in relation to the backend.

I personally have mixed feelings about what the "right" answer is, so I think it should be configurable. We should probably be able to trust the backend. At least, just enough to trust that it will close connections when no longer needed to avoid DoS'ing itself. IMO the proxy's only job is to proxy -- relay the connection from the frontend to the backend. So while I still think that in the default case, a config reload should probably clean up old connections so they aren't using the old config anymore. But we know that that's not suitable for everyone. Ok, so we let them configure some timeouts which can cause connections to survive the reloads. I believe if the site owner thinks their backend will handle connections properly, there is good reason to not want the proxy to arbitrarily drop them.

Honestly, if I was proxying websockets, I would probably use the -1 (no timeout) setting.

@francislavoie
Copy link
Member

I don't think this is true of the backend. If it leaves all connections open forever, even if idle, it'll run out of resources too. It's accepting lots of client connections.

Yeah I agree a properly implemented backend would have an absolute lifetime for their sockets as well and close way-too-old or left-hanging websockets, but there's nothing that dictates that they must. We can't make the assumption that every app does.

Honestly, if I was proxying websockets, I would probably use the -1 (no timeout) setting.

Using -1 would mean basically returning to the status-quo before #4895 (which was opened a year ago yesterday! 😅). I don't think that's desirable at all.

so I think it should be configurable

That's the thing, it is configurable. Just set a reasonably long delay. Prevents it from being immediately closed, and guarantees that Caddy's memory usage will be kept in check. Best of both. Using -1 completely removes that memory usage guarantee for no real benefit.

@mholt
Copy link
Member

mholt commented Jul 18, 2023

We can't make the assumption that every app does.

You're right, I agree -- which is why this won't be the default (for now).

Using -1 would mean basically returning to the status-quo before #4895 (which was opened a year ago yesterday! sweat_smile). I don't think that's desirable at all.

I thought the problem before was that if one side closed the connection we didn't also close the other side? 🤔 Maybe I am remembering that wrong. But if I'm right, then that old behavior was definitely a leak, and the -1 value is still different, since we'll still close the other side when one side closes.

That's the thing, it is configurable. Just set a reasonably long delay. Prevents it from being immediately closed, and guarantees that Caddy's memory usage will be kept in check. Best of both. Using -1 completely removes that memory usage guarantee for no real benefit.

I can't argue that memory management is a good thing, but I will say that timeouts -- though perhaps better than nothing -- are kind of a hack in this case. Time doesn't correlate to memory use. You can't know how long it will take to fill up memory. So if you set too low of a timeout you'll bump clients needlessly. Too high and you'll run out of memory -- or your backend will first. The thing is, if Caddy holds onto lots of connections, so will the backend app. I feel like if you configure a -1, you want to optimize for client reliability because you know your backend will manage connections properly. Or maybe your clients are all authenticated, so you can trust them. I dunno.

@francislavoie
Copy link
Member

I thought the problem before was that if one side closed the connection we didn't also close the other side? 🤔

No, we did close both sides when either closed. We just didn't do anything when config was reloaded.

I can't argue that memory management is a good thing, but I will say that timeouts -- though perhaps better than nothing -- are kind of a hack in this case. Time doesn't correlate to memory use.

Yeah I agree, it's not perfect. But it is still better than nothing, and -1 is nothing.

I think we should do as we usually do, and wait for a user with an actual valid usecase to ask for this if they actually need it, otherwise IMO we just add risky config for no reason.

@mholt
Copy link
Member

mholt commented Jul 19, 2023

This is difficult logic to argue.

I'll cede and we can wait for a feature request :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants