Memory leaks due to un-GCable event handlers #194

alkoclick · 2021-04-14T22:55:23Z

Hey folks! Long time no see!

I come bearing some good and bad news. Let's start with the bad ones: Memory leak.

Describe the bug
Here's our memory usage in Kweb, over a day:

It's honestly not great. The 04:00 drop on the left is because we actually have a nightly restart in place for the container, because just GCing wasn't good enough.

To Reproduce
Run any Kweb app for more than 24 hours. You'll start noticing leftovers from client sessions in memory.

Expected behavior
Kweb allows the GC to clean up disconnected clients faster and does not keep references to callbacks on

Summary of our current understanding

Here's what the GC chain looks like:

You'll notice that the reference that is kept is actually a listener.

Suggested paths forward

Remove listeners from the client state properly

I honestly don't have too much insight in this. I really doubt these handlers should stay open after a client disconnect, so I suspect one of the cleanups is not running properly.

Allow configuring client cleanup timeout

So this is the good news I guess. While I might misunderstand the functionality here, I believe that enabling faster client cleanups, can partially alleviate this problem and is a good value to have configurable in the long term. I've implemented this at the PR: #193

frnknglrt · 2021-04-15T19:49:14Z

Hello alkoclick,

Nice to hear from you. ;-) I always saw your name when looking at some kweb tickets / contributions and wondered what you are using kweb for.

Just out of curiosity: Whats the average request rate your service receives? And what will be the actual value of CLIENT_STATE_TIMEOUT you use in production? Is it the 48h which is set as default value?

As i used kweb extensively over the last halve of a year, i have some 5 deployments of kweb applications already running. Some of them also in production. But currently only on very light traffic. However, I expect the traffic to grow over time. So i'm really interested in having a solution for that issue.

Best Regards,

Frank

alkoclick · 2021-04-15T22:38:43Z

Hey folks, (and especially Frank)

More good news! Here's our memory consumption today:

You'll notice that it generally follows a similar pattern to the one above... but at only 30% of the memory footprint, capping out at ~600Mb compared to the previous 2.1Gb

What are we running?

A modified version of Kweb 0.8.5 (we're stuck there due to #186 and #189), with the patch created in #193 on top of it.

For some background, this runs into a Kubernetes environment, where the host allocates 6 Gs of memory to K8s. The Kweb container has today served 14203 requests (though, I'll disappoint you, 9367 of those were healthchecks).

We are using KWEB_CLIENT_STATE_TIMEOUT: PT1H in production, so that's one hour. It's served us wonderfully. We are seeing a massive reduction in memory usage, as demonstrated above. I've been very cautious about investigating if the low timeout can cause increased websocket failures, or HTTP 500s for clients (I was able to trigger this by setting it to 30s and testing locally). Metrics show both of those failure scenarios are decreased today, despite seeing increased client traffic.

From my current viewpoint, the correct client state timeout value for most applications should be 1,2,4 or 12 hours, depending on usage patterns. The default should definitely not be 48, that's very high, probably 4 is a good balance if your clients are doing long sessions and 1-2 are pretty effective if you have a lot of clients but on shorter sessions.

sanity · 2021-04-16T14:58:59Z

Yeah, agree - I'm not sure why I set the client state so high initially but 4 hours seems much more reasonable.

If there are still memory leaks it might be that some of the client state should be stored in a Map with weak keys or values, but that would depend on what is leaking.

… state timeout to 20 minutes, but the timeout is reset every time the client state is accessed.

…sconnects

frnknglrt · 2021-04-16T20:30:11Z

Good evening,

imho the acutal config value depends on the specific use case of kweb. E.g. consider some "monitor" page showing the status of a system or similar. There you won't have any user interaction and the session will run pretty much forever. In those situations i'd say that the 48 hours timeout is a good setting.

Such a system could be implemented quite nicely with kweb. On the server side you watch / listen for state changes and when they occur, you simply update the ui in the session. Without kweb, all the client side update infrastructure / polling / making sure that the client is alive / ... is quite some work to get right.

Back to the topic:

Maybe if i have an hour or an half, i'd write some kind of load test to see if there is the need for the map with weak keys you mentioned @sanity. I'd expect a selenium test to be the most effective. A simple http call without corresponding websocket + session will most likely not trigger any problems.

sanity · 2021-04-16T21:20:45Z

Thanks for the comment. You make a good point about very long-running pages like monitor pages, we need to support that.

I recently committed a change where it uses a Guava cache which will expire RemoteClientStates (RCSs) 4 hours after they were last read. I've also made it remove the client states when the WebSocket terminates, although that will mean that a page refresh is required if the websocket is broken (whereas previously the client could just reconnect and continue).

Perhaps the best approach would be:

RCSs aren't deleted until a long time after their last read (my recent change to make this 4 hours may be sufficient, but could be longer)
If a websocket disconnects then the corresponding RCS isn't removed immediately, the client is given some time to reconnect - perhaps 5 minutes
Verify that the JavaScript behaves intelligently in the event of a websocket disconnect, ideally some kind of exponential backoff algorithm

Thoughts?

alkoclick · 2021-04-17T13:01:30Z

I've also made it remove the client states when the WebSocket terminates, although that will mean that a page refresh is required if the websocket is broken (whereas previously the client could just reconnect and continue).

I'm a bit cautious about this tbh, it sounds good on paper, but not sure if it'll have any unintended consequences. We'll find out I guess :D

I like the proposal you've made, but I feel like it requires some technical work and there should be something simpler. So, my question is, is there any heartbeat in place? Because if we put a heartbeat for active clients, let's say per minute or so, then we can use that to detect whether a session is still alive or not. It will probably use the existing ws channels, which means it will set the "last read" time to recently enough that cleanup will not remove these long-running clients.

So in practice:

4H delete remains, can even drop lower (1h or 30m) since we know those cleanups really refer to inactive sessions
If a websocket disconnects, take no action. The cleanup will get to it.

P.S Note that we've largely addressed the crux of the issue here (#194), so we should consider moving to a new issue for this discussion

sanity · 2021-04-17T13:12:57Z

I'm a bit cautious about this tbh, it sounds good on paper, but not sure if it'll have any unintended consequences. We'll find out I guess :D

I'm not married to the change, it's just in master, I won't do a release until we're happy with the solution (I should probably have done it in a branch). I agree that there may be unintended consequences.

Yeah, I did think about a heartbeat, we don't currently have one at the "kweb" level, but there is one at the Ktor websocket level (which I assume will ensure that a dropped websocket connection will be detected).

I like your proposal, I think it's simpler - shouldn't drop connections prematurely (in the event of a temporary websocket disconnection).

The only thing is that a heartbeat message will probably require a new message type, which may conflict with @Derek52's work in #190 (although not necessarily too much of a headache).

alkoclick · 2021-04-17T21:42:51Z

Yeah, I did think about a heartbeat, we don't currently have one at the "kweb" level, but there is one at the Ktor websocket level

That looks super useful, I think we can base a solution on that

sanity · 2021-05-14T17:59:35Z

@alkoclick I believe that the 0.9.0 release should fix the issues you mentioned you were stuck on.

sanity · 2021-05-17T15:40:24Z

@alkoclick Just wanted to check in and see whether this was still an issue?

alkoclick · 2021-05-17T17:39:47Z

Hey @sanity I'm aware of the update and I will post back here once I can confirm this is resolved, but the multitude of breaking changes in recent versions of Kweb makes it significantly more effort-ful to upgrade

sanity · 2021-05-17T18:29:49Z

@alkoclick Yeah, the breaking changes are unfortunate but I'm trying to get the API stable before the 1.0.0 release, after that we will be a lot more conservative. I think we're past the worst of that now, the remaining pre-1.0.0 items shouldn't require breaking changes.

…o it can drop the ClientState, relevent to #199 and #194

sanity · 2021-05-23T15:11:27Z

0.9.3 should improve things here, the ClientState will now only be dropped after 4 hours, or when the client notifies the server that the page has been unloaded (ie. someone leaves the page).

Going to close this for now, please reopen if there are still issues.

sanity · 2021-05-24T04:03:48Z

Just a note that 0.9.6 should significantly improve server memory usage by reducing the client state cache timeout to 5 minutes, and adding a keepalive that fires every minute.

sanity added the bug label Apr 16, 2021

sanity added a commit that referenced this issue Apr 16, 2021

Address #194 by storing client state in a Guava cache, and reduce the…

9e833e7

… state timeout to 20 minutes, but the timeout is reset every time the client state is accessed.

sanity added a commit that referenced this issue Apr 16, 2021

Should help more with #194 by removing client state when websocket di…

d04b76a

…sconnects

sanity added a commit that referenced this issue May 23, 2021

To reduce memory usage, when the web page unloads notify the server s…

eb951f6

…o it can drop the ClientState, relevent to #199 and #194

sanity closed this as completed May 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leaks due to un-GCable event handlers #194

Memory leaks due to un-GCable event handlers #194

alkoclick commented Apr 14, 2021 •

edited

Loading

frnknglrt commented Apr 15, 2021

alkoclick commented Apr 15, 2021

sanity commented Apr 16, 2021

frnknglrt commented Apr 16, 2021

sanity commented Apr 16, 2021 •

edited

Loading

alkoclick commented Apr 17, 2021

sanity commented Apr 17, 2021

alkoclick commented Apr 17, 2021

sanity commented May 14, 2021

sanity commented May 17, 2021

alkoclick commented May 17, 2021

sanity commented May 17, 2021 •

edited

Loading

sanity commented May 23, 2021

sanity commented May 24, 2021

Memory leaks due to un-GCable event handlers #194

Memory leaks due to un-GCable event handlers #194

Comments

alkoclick commented Apr 14, 2021 • edited Loading

Summary of our current understanding

Suggested paths forward

Remove listeners from the client state properly

Allow configuring client cleanup timeout

frnknglrt commented Apr 15, 2021

alkoclick commented Apr 15, 2021

sanity commented Apr 16, 2021

frnknglrt commented Apr 16, 2021

sanity commented Apr 16, 2021 • edited Loading

alkoclick commented Apr 17, 2021

sanity commented Apr 17, 2021

alkoclick commented Apr 17, 2021

sanity commented May 14, 2021

sanity commented May 17, 2021

alkoclick commented May 17, 2021

sanity commented May 17, 2021 • edited Loading

sanity commented May 23, 2021

sanity commented May 24, 2021

alkoclick commented Apr 14, 2021 •

edited

Loading

sanity commented Apr 16, 2021 •

edited

Loading

sanity commented May 17, 2021 •

edited

Loading