Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] aquatic-ws - Memory Leak #169

Closed
SilentBot1 opened this issue Jan 5, 2024 · 9 comments
Closed

[bug] aquatic-ws - Memory Leak #169

SilentBot1 opened this issue Jan 5, 2024 · 9 comments

Comments

@SilentBot1
Copy link

After leaving an instance of aquatic-ws @ e2a3211 running on a Ubuntu 22.04.3 LTS machine, with this configuration, it appears there is a memory leak in the aquatic_ws process, causing memory usage to increase over the span of multiple days, until the process crashes due to exhausting all system memory (in my case ~2.4GB free - total system memory of 4GB).

An example of peer counts, message throughput and usage can be seen in the following image:
image

It appears the memory leak is related to peer count in some way, as the memory usage rate increases more under high load, and slows during lower load, but none the less increases over time.

I plan to disable metrics to verify if this could be the cause and will report back if this helps alleviate the issue, but I thought it would be best to open the issue first and provide updates as I try and troubleshoot this.

@greatest-ape
Copy link
Owner

greatest-ape commented Jan 6, 2024

Thanks. I’ve opened an issue in the glommio repository, the async runtime that I’m using. But it would still be great to see the results without metrics. I have a suspicion that https://docs.rs/metrics-exporter-prometheus/latest/metrics_exporter_prometheus/ doesn’t free memory (for peer clients and peer id prefixes), and I would be I interested in seeing if that is indeed the case , and if so, how much of the aquatic leak comes from metrics.

@SilentBot1
Copy link
Author

Thanks for looking into this and raising an issue with glommio, seems like it will maybe be a difficult one - it's not too critical for me at the moment, as a restart once every 5-7 days isn't too bad, but if/as usage increases, I imagine this will only become more frequent.

For the first 5 hours after a restart, with statistics both enabled and disabled, this is what memory usage looks like:

Statistics Enabled:

aoKXUsmt6Y

Statistics Disabled:

uSQO7cM4wU

In the statistics enabled timeframe, 137k connections were made to the tracker, during the statistics disabled timeframe 175k connections were made to the tracker, so this possibly explains the difference between the two at the end of the 5 hours, as the total usage at the end ended up higher, even with statistics disabled.

It looks like if there is any memory leaking from metrics_exporter_prometheus for peer client/prefixes, it's masked entirely by the glommio issue at the moment.

@greatest-ape
Copy link
Owner

Great, thanks. Yes, it might unfortunately take a while to fix this.

@greatest-ape
Copy link
Owner

Actually, I came up with an idea to possibly circumvent the issue. In local testing, it seems to fix the leak. Could you please try out the lastest commit on master?

@SilentBot1
Copy link
Author

Thank you, I have just updated and restarted - will keep you posted.

@SilentBot1
Copy link
Author

Just to provide you another 5 hour mark update, things are looking a whole lot better:
image

I will note, metrics now appear to be undulating after the restart:
image

After further looking into this, It only appears to be affecting BitComet peers (which don't actually support WebTorrent, only pulling stats from WebSocket trackers), as they don't appear to be keeping just one socket open continuously:

image

@greatest-ape
Copy link
Owner

greatest-ape commented Jan 8, 2024

Excellent!

The undulating BitComet counts are somewhat strange, but by your description, this seems to happen due to the clients acting in a nonstandard way rather than the tracker.

@SilentBot1
Copy link
Author

Just to provide a further update, it seems like things have continued to work as expected with the provided fix from the 7th onwards:

image

I'll close the issue down as the fix you've implemented has resolved the leak, though if you would like it to track the underlaying glommio issue, feel free to re-open this.

Thanks again for your help.

@greatest-ape
Copy link
Owner

Great! Thanks for the detailed reports and for trying out the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants