Memory leak after node finishes syncing with relay chain #11604

NZT48 · 2022-06-06T15:28:17Z

Is there an existing issue?

I have searched the existing issues
This can be potentially related issues, because they are recent and have memory issues, but this is not the same, no transactions were executed and memory leaking starts after syncing is finished:
High memory usage with RocksDB (ParityDB is fine) #11513
Memory consumption with large Transaction Pool #11503

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

After finished syncing with Polkadot, parachain node memory usage starts constantly growing until reaches maximum and then relay chain part of node stops working (only parachain logs are visible, no relaychain logs and no connection to relay chain).

At the image below is shown how memory usage starts growing after syncing is finished, node consumes around 32GB of memory in 24h:

Server specs:

Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz 16 cores
32GB RAM
Ubuntu 20.04

Additional information:

The issue is happening when Polkadot 0.9.22 and Polkadot 0.9.23 are used as dependencies for node part
For runtime is used Polkadot 0.9.18 because that was latest version when crowdloan was started
This issue existed before and after parachain block production started.
Node is started using --execution wasm flag for both parachain and relay chain part
Using following flags didn't fixed the issue: --db-cache 0 and --pruning=1000
Happening to all nodes that are connected to Polkadot, hard to reproduce and only twice seen on nodes that were connected to Rococo, never seen on nodes connected to self-hosted Rococo relay chain
Happening to all types of nodes: Collator, Archive, Full, RPC

When starting node, error logs from issue on cumulus repo are logged, but I guess this is not related to this issue.

Steps to reproduce

Compile code from this branch.
Chainspec available here.

But it would be good to get in contact about details of connecting, because this is parachain connected to Polkadot.

bkchr · 2022-06-07T09:20:41Z

Is this reproducible all the times? Regardless of the machine configuration etc?

NZT48 · 2022-06-07T10:22:51Z

Yes, it is happening on all our machines, and they have different CPUs, tried on machines from 8-32GB Ram, and using SSD or external volume for storage.

@riusricardo tried to reproduce it on his laptop, but I'm not sure about results.

hussein-aitlahcen · 2022-06-07T11:56:04Z

We are experiencing the same issue at Composable. All the machines, regardless of configuration. Only happening since v0.9.22 though

nazar-pc · 2022-06-07T16:47:16Z

I have seen similar situation after the most recent Substrate upgrade of our chain on one of several nodes on my machine (consumed over 65G of RAM before I killed it, restart has shown that memory is growing again, so I stopped it permanently), but the rest of the nodes on the same machine and others are fine somehow.

koute · 2022-06-09T10:55:28Z

I'm trying to reproduce the issue but so far no luck.

If anyone can reproduce this I'd appreciate if you could do the following:

Compile your node binary with debug symbols. (Open Cargo.toml and add debug = true to [profile.release] then run cargo build --release; this might not be necessary, but it's better to have it than not.)
Download and unpack in the same directory where your node binary is: https://github.com/koute/bytehound/releases/download/0.8.0/bytehound-x86_64-unknown-linux-gnu.tgz
Run this:

export MEMORY_PROFILER_LOG=info
export MEMORY_PROFILER_CULL_TEMPORARY_ALLOCATIONS=1
export MEMORY_PROFILER_LOGFILE=memory-profiling_%e_%t_%p.log
export MEMORY_PROFILER_TEMPORARY_ALLOCATION_LIFETIME_THRESHOLD=60000
export MEMORY_PROFILER_TEMPORARY_ALLOCATION_PENDING_THRESHOLD=655360
export MEMORY_PROFILER_GRAB_BACKTRACES_ON_FREE=1
export MEMORY_PROFILER_OUTPUT=memory-profiling_%e_%t_%p.dat
LD_PRELOAD=./libbytehound.so ./your-node your-args

Replace the ./your-node your-args with the command line invocation you use to start your node.

Check that a memory-profiling_*.dat file was created in the same directory and that it's growing in size with time. If it doesn't you did something wrong.
Wait until it runs out of memory.
Upload the memory-profiling_* files somewhere and send me the link.

I recommend doing this on a machine with a lot of memory and a few spare CPU cores, since the profiling also has extra overhead.

NZT48 · 2022-06-10T10:39:13Z

Thanks @koute, I have started it yesterday and now waiting for node to run out of memory. Can you contact me on element (nzt:matrix.org) so I can provide you the download link for files?

koute · 2022-06-13T13:58:09Z

@NZT48 Thank you for the data!

I have done preliminary investigation of the problem; there's indeed a memory leak here:

The leak's in an unbound channel for the network service events; the other side of the channel stopped receiving these events (either due to a deadlock, or because it was leaked) so they're just being pushed into the channel until the system runs out of memory.

From what I can see the stream that's returned from here is the problem in substrate/client/network/src/service.rs:

        /// Returns a stream containing the events that happen on the network.
        ///
        /// If this method is called multiple times, the events are duplicated.
        ///
        /// The stream never ends (unless the `NetworkWorker` gets shut down).
        ///
        /// The name passed is used to identify the channel in the Prometheus metrics. Note that the
        /// parameter is a `&'static str`, and not a `String`, in order to avoid accidentally having
        /// an unbounded set of Prometheus metrics, which would be quite bad in terms of memory
        pub fn event_stream(&self, name: &'static str) -> impl Stream<Item = Event> {
                let (tx, rx) = out_events::channel(name);
                let _ = self.to_worker.unbounded_send(ServiceToWorkerMsg::EventStream(tx));
                rx
        }

In this case over 10 million events were pushed into this channel before the node crashed. We should probably put a hard cap on this channel or at least spam with an error if it gets to big. A high enough cap of, say, 1 million, should still make it effectively unbound in normal cases (based on this data it'd have to leak ~2.5GB of data to reach this many items) but it'd make investigating such problems easier in the future.

Anyway, I'll continue investigating this.

koute · 2022-06-17T10:53:04Z

Here's why it's leaking:

In cumulus/client/relay-chain-inprocess-interface/src/lib.rs we initialize the relay chain node like this with enable_beefy being always unconditionally set to true:

let relay_chain_full_node = polkadot_service::build_full(
    // ...
    true, // This is `enable_beefy`.
    // ...
)?;

In polkadot/node/service/src/lib.rs if BEEFY is enabled we spawn the BEEFY gadget:

if enable_beefy {
    // ...
    let gadget = beefy_gadget::start_beefy_gadget::<_, _, _, _, _>(beefy_params);
    // ...
    task_manager.spawn_handle().spawn_blocking("beefy-gadget", None, gadget);
}

In substrate/client/beefy/src/lib.rs the gadget initializes sc_network_gossip::GossipEngine:

let gossip_engine = sc_network_gossip::GossipEngine::new(
    network,
    protocol_name,
    gossip_validator.clone(),
    None,
);

In substrate/client/network-gossip/src/bridge.rs the GossipEngine creates a stream of network events (this is the channel which I've shown to be leaking):

let network_event_stream = network.event_stream();

and it only polls that event stream when the GossipEngine itself is polled. This means that if GossipEngine is never polled it will leak memory.

In substrate/client/beefy/src/worker.rs the BEEFY worker has the following run function:

pub(crate) async fn run(mut self) {
    info!(target: "beefy", "🥩 run BEEFY worker, best grandpa: #{:?}.", self.best_grandpa_block_header.number());
    self.wait_for_runtime_pallet().await;
    // ...
    // <-- The `GossipEngine` is polled here.

...so it waits for the BEEFY runtime pallet, but since it doesn't actually exist it ends up waiting forever, while the events in network_event_stream inside of the GossipEngine continue being accumulated.

koute · 2022-06-17T10:57:01Z

Fix available here: #11694

bkchr · 2022-06-17T11:38:53Z

In cumulus/client/relay-chain-inprocess-interface/src/lib.rs we initialize the relay chain node like this with enable_beefy being always unconditionally set to true:

@skunert please fix this. Beefy should always be disabled (at least for now), for the in process relay chain node.

irsal · 2022-07-21T23:09:02Z

Hey @bkchr - this ended up being a pretty gnarly issue for us (Turing Network on Kusama) when we upgraded to 0.9.23. https://github.com/OAK-Foundation/OAK-blockchain/wiki/2022-07-21:-Node-Memory-Leak-Issue

Was there any way we could've been notified of the implications for this version upgrade? Perhaps I'm just missing a notification channel or tag to track.

bkchr · 2022-07-22T16:02:41Z

@irsal sadly there is no good channel for this at the moment.

* Update ethy-gadget w upstream some upstream changes * if node is (major) syncing don't participate in ethy * Memory leak after node finishes syncing with relay chain paritytech/substrate#11604) * proof threshold is now set by runtime (before it was hard coded)

github-actions bot added the J2-unconfirmed Issue might be valid, but it’s not yet known. label Jun 6, 2022

Akagi201 mentioned this issue Jun 15, 2022

Memory leak bifrost-io/bifrost#620

Closed

koute mentioned this issue Jun 17, 2022

Pump the gossip engine while waiting for the BEEFY runtime pallet (memory leak fix) #11694

Merged

skunert mentioned this issue Jun 17, 2022

Disable beefy for inprocess relay chain node paritytech/cumulus#1382

Merged

paritytech-processbot bot closed this as completed in paritytech/cumulus#1382 Jun 17, 2022

jordy25519 mentioned this issue Sep 2, 2022

Chore/ethy gadget update futureversecom/trn-seed#44

Merged

koute mentioned this issue Sep 29, 2022

Member Request polkadot-fellows/seeding#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak after node finishes syncing with relay chain #11604

Memory leak after node finishes syncing with relay chain #11604

NZT48 commented Jun 6, 2022 •

edited

Loading

bkchr commented Jun 7, 2022

NZT48 commented Jun 7, 2022

hussein-aitlahcen commented Jun 7, 2022

nazar-pc commented Jun 7, 2022

koute commented Jun 9, 2022

NZT48 commented Jun 10, 2022

koute commented Jun 13, 2022 •

edited

Loading

koute commented Jun 17, 2022

koute commented Jun 17, 2022

bkchr commented Jun 17, 2022 •

edited

Loading

irsal commented Jul 21, 2022

bkchr commented Jul 22, 2022

Memory leak after node finishes syncing with relay chain #11604

Memory leak after node finishes syncing with relay chain #11604

Comments

NZT48 commented Jun 6, 2022 • edited Loading

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

bkchr commented Jun 7, 2022

NZT48 commented Jun 7, 2022

hussein-aitlahcen commented Jun 7, 2022

nazar-pc commented Jun 7, 2022

koute commented Jun 9, 2022

NZT48 commented Jun 10, 2022

koute commented Jun 13, 2022 • edited Loading

koute commented Jun 17, 2022

koute commented Jun 17, 2022

bkchr commented Jun 17, 2022 • edited Loading

irsal commented Jul 21, 2022

bkchr commented Jul 22, 2022

NZT48 commented Jun 6, 2022 •

edited

Loading

koute commented Jun 13, 2022 •

edited

Loading

bkchr commented Jun 17, 2022 •

edited

Loading