Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Memory Consumption Tracking Issue (OOM) #4918

Closed
AgeManning opened this issue Nov 9, 2023 · 9 comments
Closed

Large Memory Consumption Tracking Issue (OOM) #4918

AgeManning opened this issue Nov 9, 2023 · 9 comments
Labels

Comments

@AgeManning
Copy link
Member

AgeManning commented Nov 9, 2023

Description

We are aware of an issue on the mainnet network which is causing Lighthouse to consume more memory than it should. This is leading to Out of Memory (OOM) process terminations on some machines.

The root cause of the issue (we believe) are messages being queued on gossipsub to be sent out. This is a combination of messages being published, messages being forward and gossipsub control messages. The queues are filling up and the memory is not being dropped. This appears to only be occuring on mainnet, we assume in part to the size of the network and the number of messages being transmitted.

There are a number of solutions being put in place and being tested. This issue is mainly a tracking issue, so users can follow along with development updates as we correct this issue.

Primarly the end solution will consist of more efficient memory management (avoid duplicating any memory in messages when sending) this should reduce allocations, a priortisation of messages so that we can prioritise published, forward and control messages individually and finally a dropping mechanism that allows us to drop messages when the queues grow too large.

Memory Allocations:

Message Prioritisation:

@thomaseizinger
Copy link

Memory Allocations:

@tobidae-cb
Copy link

Following - this may be related to an issue we're also seeing - #4953

@AgeManning
Copy link
Member Author

AgeManning commented Nov 30, 2023

To keep people updated on the progress.

We need to update lighthouse to the latest libp2p:
#4935

We have message priority sorted:
libp2p/rust-libp2p#4914

And we have a form of time-bound message dropping:
sigp/rust-libp2p#555

We are going to combine these into a rust-libp2p fork and start testing on live networks.

@AgeManning
Copy link
Member Author

AgeManning commented Dec 6, 2023

Further updates.

We have a rust-libp2p fork which we are now testing. The new features we have added are:

We are adding these to Lighthouse and thoroughly testing ahead of a new lighthouse release.

@thomaseizinger
Copy link

Great! Looking forward to seeing these in rust-libp2p!

@paulhauner
Copy link
Member

Are you happy to close this @AgeManning? If not, perhaps we can at least drop the v4.6.0 tag.

@AgeManning
Copy link
Member Author

This has been resolved in 4.6.0 and the RC.

@diegomrsantos
Copy link

We are adding these to Lighthouse and thoroughly testing ahead of a new lighthouse release.

@AgeManning could you please share what were the test results?

@AgeManning
Copy link
Member Author

AgeManning commented Feb 27, 2024

Sure. Sorry for the late reply. @diegomrsantos

I think I've lost all our pretty graphs during all the analysis, but perhaps one of the others can chime in if they saved them or want to look back to previous data.

Fundamentally, Lighthouse beacon nodes (depending on their peers) would OOM. The memory profile grew to the order of 16GB before "randomly" freeing it back to 2-4GB. The graphs we were looking at would show memory slowly growing up to these numbers then occasionally dropping back down.

The drops were due to peers being disconnected and the massive send queues being freed. It occurred because slow peers would accumulate a huge queue of messages to be sent to them and bloat lighthouse's memory footprint.

After our changes, the memory profile stays steady at a 2-4GB on all the nodes that ran the patch.

We implemented a fancier queuing system inside gossipsub that prioritises important messages to be sent and if others are waiting too long to be sent, we simply drop them and don't bother sending them. So if a peer has a backlog, we remove older messages to make way for newer messages.

Here are what the queues look like on a normal node with poor bandwidth, currently running on mainnet:

queues

The queues are now bounded and as you can see they never really get populated up to their bounds.

I could also show you a memory profile of a current lighthouse node, but its fairly bland (which is a good thing) with no wild memory spikes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants