Skip to content
This repository has been archived by the owner on Aug 10, 2024. It is now read-only.

Memory leaks due to un-GCable event handlers #194

Closed
alkoclick opened this issue Apr 14, 2021 · 14 comments
Closed

Memory leaks due to un-GCable event handlers #194

alkoclick opened this issue Apr 14, 2021 · 14 comments
Labels

Comments

@alkoclick
Copy link
Collaborator

alkoclick commented Apr 14, 2021

Hey folks! Long time no see!

I come bearing some good and bad news. Let's start with the bad ones: Memory leak.

Describe the bug
Here's our memory usage in Kweb, over a day:
image

It's honestly not great. The 04:00 drop on the left is because we actually have a nightly restart in place for the container, because just GCing wasn't good enough.

To Reproduce
Run any Kweb app for more than 24 hours. You'll start noticing leftovers from client sessions in memory.

Expected behavior
Kweb allows the GC to clean up disconnected clients faster and does not keep references to callbacks on

Summary of our current understanding

Here's what the GC chain looks like:
image

You'll notice that the reference that is kept is actually a listener.

Suggested paths forward

Remove listeners from the client state properly

I honestly don't have too much insight in this. I really doubt these handlers should stay open after a client disconnect, so I suspect one of the cleanups is not running properly.

Allow configuring client cleanup timeout

So this is the good news I guess. While I might misunderstand the functionality here, I believe that enabling faster client cleanups, can partially alleviate this problem and is a good value to have configurable in the long term. I've implemented this at the PR: #193

@frnknglrt
Copy link

Hello alkoclick,

Nice to hear from you. ;-) I always saw your name when looking at some kweb tickets / contributions and wondered what you are using kweb for.

Just out of curiosity: Whats the average request rate your service receives? And what will be the actual value of CLIENT_STATE_TIMEOUT you use in production? Is it the 48h which is set as default value?

As i used kweb extensively over the last halve of a year, i have some 5 deployments of kweb applications already running. Some of them also in production. But currently only on very light traffic. However, I expect the traffic to grow over time. So i'm really interested in having a solution for that issue.

Best Regards,

Frank

@alkoclick
Copy link
Collaborator Author

Hey folks, (and especially Frank)

More good news! Here's our memory consumption today:

image

You'll notice that it generally follows a similar pattern to the one above... but at only 30% of the memory footprint, capping out at ~600Mb compared to the previous 2.1Gb

What are we running?

A modified version of Kweb 0.8.5 (we're stuck there due to #186 and #189), with the patch created in #193 on top of it.

For some background, this runs into a Kubernetes environment, where the host allocates 6 Gs of memory to K8s. The Kweb container has today served 14203 requests (though, I'll disappoint you, 9367 of those were healthchecks).

We are using KWEB_CLIENT_STATE_TIMEOUT: PT1H in production, so that's one hour. It's served us wonderfully. We are seeing a massive reduction in memory usage, as demonstrated above. I've been very cautious about investigating if the low timeout can cause increased websocket failures, or HTTP 500s for clients (I was able to trigger this by setting it to 30s and testing locally). Metrics show both of those failure scenarios are decreased today, despite seeing increased client traffic.

From my current viewpoint, the correct client state timeout value for most applications should be 1,2,4 or 12 hours, depending on usage patterns. The default should definitely not be 48, that's very high, probably 4 is a good balance if your clients are doing long sessions and 1-2 are pretty effective if you have a lot of clients but on shorter sessions.

@sanity
Copy link
Member

sanity commented Apr 16, 2021

Yeah, agree - I'm not sure why I set the client state so high initially but 4 hours seems much more reasonable.

If there are still memory leaks it might be that some of the client state should be stored in a Map with weak keys or values, but that would depend on what is leaking.

@sanity sanity added the bug label Apr 16, 2021
sanity added a commit that referenced this issue Apr 16, 2021
… state

timeout to 20 minutes, but the timeout is reset every time the client state
is accessed.
@frnknglrt
Copy link

Good evening,

imho the acutal config value depends on the specific use case of kweb. E.g. consider some "monitor" page showing the status of a system or similar. There you won't have any user interaction and the session will run pretty much forever. In those situations i'd say that the 48 hours timeout is a good setting.

Such a system could be implemented quite nicely with kweb. On the server side you watch / listen for state changes and when they occur, you simply update the ui in the session. Without kweb, all the client side update infrastructure / polling / making sure that the client is alive / ... is quite some work to get right.

Back to the topic:

Maybe if i have an hour or an half, i'd write some kind of load test to see if there is the need for the map with weak keys you mentioned @sanity. I'd expect a selenium test to be the most effective. A simple http call without corresponding websocket + session will most likely not trigger any problems.

@sanity
Copy link
Member

sanity commented Apr 16, 2021

Thanks for the comment. You make a good point about very long-running pages like monitor pages, we need to support that.

I recently committed a change where it uses a Guava cache which will expire RemoteClientStates (RCSs) 4 hours after they were last read. I've also made it remove the client states when the WebSocket terminates, although that will mean that a page refresh is required if the websocket is broken (whereas previously the client could just reconnect and continue).

Perhaps the best approach would be:

  • RCSs aren't deleted until a long time after their last read (my recent change to make this 4 hours may be sufficient, but could be longer)
  • If a websocket disconnects then the corresponding RCS isn't removed immediately, the client is given some time to reconnect - perhaps 5 minutes
  • Verify that the JavaScript behaves intelligently in the event of a websocket disconnect, ideally some kind of exponential backoff algorithm

Thoughts?

@alkoclick
Copy link
Collaborator Author

I've also made it remove the client states when the WebSocket terminates, although that will mean that a page refresh is required if the websocket is broken (whereas previously the client could just reconnect and continue).

I'm a bit cautious about this tbh, it sounds good on paper, but not sure if it'll have any unintended consequences. We'll find out I guess :D

I like the proposal you've made, but I feel like it requires some technical work and there should be something simpler. So, my question is, is there any heartbeat in place? Because if we put a heartbeat for active clients, let's say per minute or so, then we can use that to detect whether a session is still alive or not. It will probably use the existing ws channels, which means it will set the "last read" time to recently enough that cleanup will not remove these long-running clients.

So in practice:

  • 4H delete remains, can even drop lower (1h or 30m) since we know those cleanups really refer to inactive sessions
  • If a websocket disconnects, take no action. The cleanup will get to it.

P.S Note that we've largely addressed the crux of the issue here (#194), so we should consider moving to a new issue for this discussion

@sanity
Copy link
Member

sanity commented Apr 17, 2021

I'm a bit cautious about this tbh, it sounds good on paper, but not sure if it'll have any unintended consequences. We'll find out I guess :D

I'm not married to the change, it's just in master, I won't do a release until we're happy with the solution (I should probably have done it in a branch). I agree that there may be unintended consequences.

Yeah, I did think about a heartbeat, we don't currently have one at the "kweb" level, but there is one at the Ktor websocket level (which I assume will ensure that a dropped websocket connection will be detected).

I like your proposal, I think it's simpler - shouldn't drop connections prematurely (in the event of a temporary websocket disconnection).

The only thing is that a heartbeat message will probably require a new message type, which may conflict with @Derek52's work in #190 (although not necessarily too much of a headache).

@alkoclick
Copy link
Collaborator Author

Yeah, I did think about a heartbeat, we don't currently have one at the "kweb" level, but there is one at the Ktor websocket level

That looks super useful, I think we can base a solution on that

@sanity
Copy link
Member

sanity commented May 14, 2021

@alkoclick I believe that the 0.9.0 release should fix the issues you mentioned you were stuck on.

@sanity
Copy link
Member

sanity commented May 17, 2021

@alkoclick Just wanted to check in and see whether this was still an issue?

@alkoclick
Copy link
Collaborator Author

Hey @sanity I'm aware of the update and I will post back here once I can confirm this is resolved, but the multitude of breaking changes in recent versions of Kweb makes it significantly more effort-ful to upgrade

@sanity
Copy link
Member

sanity commented May 17, 2021

@alkoclick Yeah, the breaking changes are unfortunate but I'm trying to get the API stable before the 1.0.0 release, after that we will be a lot more conservative. I think we're past the worst of that now, the remaining pre-1.0.0 items shouldn't require breaking changes.

sanity added a commit that referenced this issue May 23, 2021
@sanity
Copy link
Member

sanity commented May 23, 2021

0.9.3 should improve things here, the ClientState will now only be dropped after 4 hours, or when the client notifies the server that the page has been unloaded (ie. someone leaves the page).

Going to close this for now, please reopen if there are still issues.

@sanity sanity closed this as completed May 23, 2021
@sanity
Copy link
Member

sanity commented May 24, 2021

Just a note that 0.9.6 should significantly improve server memory usage by reducing the client state cache timeout to 5 minutes, and adding a keepalive that fires every minute.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants