Large Memory Consumption Tracking Issue (OOM) #4918

AgeManning · 2023-11-09T01:43:15Z

Description

We are aware of an issue on the mainnet network which is causing Lighthouse to consume more memory than it should. This is leading to Out of Memory (OOM) process terminations on some machines.

The root cause of the issue (we believe) are messages being queued on gossipsub to be sent out. This is a combination of messages being published, messages being forward and gossipsub control messages. The queues are filling up and the memory is not being dropped. This appears to only be occuring on mainnet, we assume in part to the size of the network and the number of messages being transmitted.

There are a number of solutions being put in place and being tested. This issue is mainly a tracking issue, so users can follow along with development updates as we correct this issue.

Primarly the end solution will consist of more efficient memory management (avoid duplicating any memory in messages when sending) this should reduce allocations, a priortisation of messages so that we can prioritise published, forward and control messages individually and finally a dropping mechanism that allows us to drop messages when the queues grow too large.

Memory Allocations:

feat(quick-protobuf-codec): reduce allocations during encoding libp2p/rust-libp2p#4782
Arc-ing messages in gossipsub: libp2p/rust-libp2p@d28ff63

Message Prioritisation:

refactor(gossipsub): send more specific messages to ConnectionHandler libp2p/rust-libp2p#4811

The text was updated successfully, but these errors were encountered:

thomaseizinger · 2023-11-10T00:53:35Z

Memory Allocations:

RawMessage::raw_protobuf_len can be very expensive as well because it performs many temporary allocations.

tobidae-cb · 2023-11-27T03:05:00Z

Following - this may be related to an issue we're also seeing - #4953

AgeManning · 2023-11-30T05:34:48Z

To keep people updated on the progress.

We need to update lighthouse to the latest libp2p:
#4935

We have message priority sorted:
libp2p/rust-libp2p#4914

And we have a form of time-bound message dropping:
sigp/rust-libp2p#555

We are going to combine these into a rust-libp2p fork and start testing on live networks.

AgeManning · 2023-12-06T21:59:34Z

Further updates.

We have a rust-libp2p fork which we are now testing. The new features we have added are:

Backpressure in messages and priorities of messages (Gossipsub backpressure rust-libp2p#549)
Time-based message dropping from non-critical messages (Implement message time-bound dropping rust-libp2p#555)
Gossipsub scoring and the ability of application-level (think Lighthhouse) scoring for slow peers, which will cycle them from the mesh optimizing the gossipsub channel away from slow peers (Slow peer scoring rust-libp2p#556)
Extended metrics so we can visualize what is happening at the gossipsub level (Queue metrics rust-libp2p#558)
Improved speed of control messages, as this appears to be an old optimization which appears to be worse on performance. (feat(gossipsub): remove control pool rust-libp2p#559)

We are adding these to Lighthouse and thoroughly testing ahead of a new lighthouse release.

thomaseizinger · 2023-12-06T22:51:46Z

Great! Looking forward to seeing these in rust-libp2p!

paulhauner · 2024-01-23T04:34:14Z

Are you happy to close this @AgeManning? If not, perhaps we can at least drop the v4.6.0 tag.

AgeManning · 2024-01-23T23:08:31Z

This has been resolved in 4.6.0 and the RC.

diegomrsantos · 2024-02-12T13:43:57Z

We are adding these to Lighthouse and thoroughly testing ahead of a new lighthouse release.

@AgeManning could you please share what were the test results?

AgeManning · 2024-02-27T06:12:42Z

Sure. Sorry for the late reply. @diegomrsantos

I think I've lost all our pretty graphs during all the analysis, but perhaps one of the others can chime in if they saved them or want to look back to previous data.

Fundamentally, Lighthouse beacon nodes (depending on their peers) would OOM. The memory profile grew to the order of 16GB before "randomly" freeing it back to 2-4GB. The graphs we were looking at would show memory slowly growing up to these numbers then occasionally dropping back down.

The drops were due to peers being disconnected and the massive send queues being freed. It occurred because slow peers would accumulate a huge queue of messages to be sent to them and bloat lighthouse's memory footprint.

After our changes, the memory profile stays steady at a 2-4GB on all the nodes that ran the patch.

We implemented a fancier queuing system inside gossipsub that prioritises important messages to be sent and if others are waiting too long to be sent, we simply drop them and don't bother sending them. So if a peer has a backlog, we remove older messages to make way for newer messages.

Here are what the queues look like on a normal node with poor bandwidth, currently running on mainnet:

The queues are now bounded and as you can see they never really get populated up to their bounds.

I could also show you a memory profile of a current lighthouse node, but its fairly bland (which is a good thing) with no wild memory spikes.

chong-he added the Networking label Nov 9, 2023

jimmygchen mentioned this issue Nov 16, 2023

Harden against old consensus message (attestation & sync) spam #4873

Closed

michaelsproul added the v4.6.0 ETA Q1 2024 label Nov 25, 2023

SimonSMH1015 mentioned this issue Nov 27, 2023

Attestation Queue Full due to insufficient resources #4953

Open

AgeManning mentioned this issue Dec 21, 2023

feat(gossipsub): More lenient flood publishing libp2p/rust-libp2p#3666

Closed

4 tasks

michaelsproul mentioned this issue Jan 4, 2024

increased memory causes OOM #5034

Closed

michaelsproul mentioned this issue Jan 23, 2024

Missed attestations, beacon node increase CPU, increased disk read + writes #5105

Open

AgeManning closed this as completed Jan 23, 2024

chenrui333 mentioned this issue Jan 28, 2024

lighthouse 4.6.0 Homebrew/homebrew-core#161115

Merged

diegomrsantos mentioned this issue Feb 1, 2024

feat: drop msgs to be relayed waiting for too long in the queue vacp2p/nim-libp2p#1017

Draft

SjonHortensius mentioned this issue Feb 10, 2024

4.6.0 beacon_node memory usage issue #5227

Open

luarx mentioned this issue Mar 12, 2024

Erigon OOM killed on docker using a 64GB machine erigontech/erigon#9645

Closed

erhant mentioned this issue Sep 2, 2024

bug: high CPU & memory usage firstbatchxyz/dkn-compute-node#93

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Memory Consumption Tracking Issue (OOM) #4918

Large Memory Consumption Tracking Issue (OOM) #4918

AgeManning commented Nov 9, 2023 •

edited

Loading

thomaseizinger commented Nov 10, 2023

tobidae-cb commented Nov 27, 2023

AgeManning commented Nov 30, 2023 •

edited

Loading

AgeManning commented Dec 6, 2023 •

edited

Loading

thomaseizinger commented Dec 6, 2023

paulhauner commented Jan 23, 2024

AgeManning commented Jan 23, 2024

diegomrsantos commented Feb 12, 2024

AgeManning commented Feb 27, 2024 •

edited

Loading

Large Memory Consumption Tracking Issue (OOM) #4918

Large Memory Consumption Tracking Issue (OOM) #4918

Comments

AgeManning commented Nov 9, 2023 • edited Loading

Description

thomaseizinger commented Nov 10, 2023

tobidae-cb commented Nov 27, 2023

AgeManning commented Nov 30, 2023 • edited Loading

AgeManning commented Dec 6, 2023 • edited Loading

thomaseizinger commented Dec 6, 2023

paulhauner commented Jan 23, 2024

AgeManning commented Jan 23, 2024

diegomrsantos commented Feb 12, 2024

AgeManning commented Feb 27, 2024 • edited Loading

AgeManning commented Nov 9, 2023 •

edited

Loading

AgeManning commented Nov 30, 2023 •

edited

Loading

AgeManning commented Dec 6, 2023 •

edited

Loading

AgeManning commented Feb 27, 2024 •

edited

Loading