Fix resource leak caused by not cleaning up unused threads #3734

freimair · 2019-12-03T12:31:36Z

We had a small memory leak in the code base. Namely, some thread pools have not been shutdown when they have been no longer needed. Result was that the threads and the parent threads have been kept alive which lead to hundreds of stale threads over the course of several days.

We had a small memory leak in the code base. Namely, there have been some threadpools in use but not shutdown when they have been no longer needed. Result was that the threads and the parent threads have been kept alive which lead to hundreds of stale threads over the course of several days.

ripcurlx · 2019-12-03T12:55:33Z

@freimair Could you extend the title a little bit, so it is more clear what kind of memory leak is fixed if someone is just looking through the PR titles (e.g. during release note creation)

ripcurlx

utACK

julianknutsen · 2019-12-03T17:34:23Z

What was the memory footprint change after running with your patch? One leaked executor per connection does seem bad, but it would be good to understand the full impact.

Any additional information on which test cases may need to be re-run to verify everything works as expected if this goes into v1.2.4?

Any screenshots or tools we can use to help verify the new behavior?

julianknutsen · 2019-12-03T17:38:10Z

I think this type of bug is way too easy to make right now. Does anyone have an RAII solution to these single thread pool use cases that could be added to guarantee that shutdownNow() is called for each of these resources so they can be GCed?

Long term, it may make more sense to attach the per-connection executors to the manager object instead. Right now the threads grow with the number of connections which depends on the manager logic to decide how many connections to keep around.

This is much harder to understand from a performance point of view than a parent thread pool with a fixed size regardless of the number of connections.

julianknutsen

BUNDLE_OF_ENVELOPES looks deprecated so nothing should be scheduled on the executor, right?. What is the supported backwards-compatibility here and Is there a point to even keeping this code path around?

I tested v1.2.4 without this patch and don't see the threads increasing with simple disconnect/reconnect. Is there something I need to do to exercise the code path?

freimair · 2019-12-03T23:54:42Z

What was the memory footprint change after running with your patch?

Just as I attempted to get numbers, jProfiler crashed after a 48h test run and with it my bisq-under-test and the whole gnome shell. I did, however, see the fix doing its job. All in all, I decided to not waste another couple of days to rerun anything but put the PR up for merging as this is a real bug that needs fixing.

One leaked executor per connection does seem bad, but it would be good to understand the full impact.

due to the executors not being shut down, these threads stayed alive and with them, the parent threads per connection as well. Over a couple of days this behavior lead to hundreds of (unused) threads. But only if there is high network load and the Bundle_of_envelopes feature triggered.
Overall, the memory-leak part hasn't been that bad as it seems (there isn't that much change in a clients connection pool) but I have seen increased CPU usage as well - although I have no hard evidence that this is related.

A JVM profiler showed me that there has been a huge number of threads all in waiting-state. So I tracked down their source, found that there is indeed a bug causing a "thread-leak" and fixed it. My guess is that with the recently forced update, the bundle-of-envelopes stuff finally went really active as all clients made use of this.

I honestly do not know if that bug is responsible for some resource-related issues reported recently. There has been incorrect behavior and it is fixed now. I would leave it for time to tell if it has been that great of an impact in the first place.

Any additional information on which test cases may need to be re-run to verify everything works as expected if this goes into v1.2.4?

As always with our P2P part, there is no exhaustive testing possible right now. Since the change only cleanes up stuff, there should be no need for any testplan testcases to be redone. However, I suggest to maybe redo the trading tests just to be sure.

Any screenshots or tools we can use to help verify the new behavior?

Well take a JVM profiler, start up 1.2.3 and this fix and compare the number of threads after 2 days of runtime. With the fix, there should be substantially less threads. Please note that I am not entirely confident that there isn't another "thread-leak" somewhere, still. But that is for another day.

One leaked executor per connection does seem bad, but it would be good to understand the full impact.

Agreed. Generally, thread handling and synchronization is a mess right now. Needs fixing sometime.

freimair · 2019-12-04T00:02:43Z

BUNDLE_OF_ENVELOPES looks deprecated so nothing should be scheduled on the executor, right?. What is the supported backwards-compatibility here and Is there a point to even keeping this code path around?

yes, the capability "BUNDLE_OF_ENVELOPES" has been deprecated as we made it default with the forced update. So the feature is still there and active with current clients, but the capability is not required anymore.

I tested v1.2.4 without this patch and don't see the threads increasing with simple disconnect/reconnect. Is there something I need to do to exercise the code path?

The code path only gets triggered if the throttle limits get hit (defined in

bisq/p2p/src/main/java/bisq/network/p2p/network/Connection.java

Line 193 in 037f167

    
           connectionConfig = new ConnectionConfig(MSG_THROTTLE_PER_SEC, MSG_THROTTLE_PER_10_SEC, SEND_MSG_THROTTLE_TRIGGER, SEND_MSG_THROTTLE_SLEEP);

)

as said above, this stuff is hard to test. After initial tests I had either version run in production for days with a profiler attached...

julianknutsen

ACK

Luckily verifying this didn't take the two days you suggested. I spent some time learning more about jprofiler views and updating the executor to set the thread name so it could be tracked. Here is the comparison before and after your patch to verify the threads no longer leak:

Before: (bundleSender > # connected peers) & no dead thread

After: (bundleSender == # connected peers) & 1 dead thread

From the heap inspection, it looks like each executor is about 300-600 bytes each so that is the magnitude of the leak on each disconnect that isn't cleaned up.

Testing

I ran this through a few localnet smoke tests that take nodes up and down cases to verify no egregious stack traces. As suggested, it is probably a good idea to rerun trade tests that take nodes up and down if this is cherry-picked to the branch.

I audited the other executor shutdown paths. It follows the same pattern as the other executor in the Connection object so any bug in the shutdown path would likely have triggered there already.

freimair requested a review from ripcurlx as a code owner December 3, 2019 12:31

ripcurlx approved these changes Dec 3, 2019

View reviewed changes

freimair changed the title ~~Fix memory leak~~ Fix memory leak caused by not cleaning up unused threads Dec 3, 2019

freimair mentioned this pull request Dec 3, 2019

For Cycle 8 bisq-network/compensation#411

Closed

julianknutsen reviewed Dec 3, 2019

View reviewed changes

freimair changed the title ~~Fix memory leak caused by not cleaning up unused threads~~ Fix resource leak caused by not cleaning up unused threads Dec 3, 2019

julianknutsen approved these changes Dec 4, 2019

View reviewed changes

julianknutsen mentioned this pull request Dec 4, 2019

For Cycle 8 bisq-network/compensation#413

Closed

ripcurlx merged commit d0bed93 into bisq-network:master Dec 4, 2019

freimair deleted the fix_thread_leak branch December 4, 2019 14:30

freimair restored the fix_thread_leak branch December 6, 2019 19:39

freimair deleted the fix_thread_leak branch December 6, 2019 19:46

ripcurlx mentioned this pull request Dec 13, 2019

For Cycle 8 bisq-network/compensation#439

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix resource leak caused by not cleaning up unused threads #3734

Fix resource leak caused by not cleaning up unused threads #3734

freimair commented Dec 3, 2019

ripcurlx commented Dec 3, 2019

ripcurlx left a comment

julianknutsen commented Dec 3, 2019

julianknutsen commented Dec 3, 2019

julianknutsen left a comment

freimair commented Dec 3, 2019

freimair commented Dec 4, 2019

julianknutsen left a comment •

edited

Loading

Fix resource leak caused by not cleaning up unused threads #3734

Fix resource leak caused by not cleaning up unused threads #3734

Conversation

freimair commented Dec 3, 2019

ripcurlx commented Dec 3, 2019

ripcurlx left a comment

Choose a reason for hiding this comment

julianknutsen commented Dec 3, 2019

julianknutsen commented Dec 3, 2019

julianknutsen left a comment

Choose a reason for hiding this comment

freimair commented Dec 3, 2019

freimair commented Dec 4, 2019

julianknutsen left a comment • edited Loading

Choose a reason for hiding this comment

julianknutsen left a comment •

edited

Loading