-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Memory leak after node finishes syncing with relay chain #11604
Comments
Is this reproducible all the times? Regardless of the machine configuration etc? |
Yes, it is happening on all our machines, and they have different CPUs, tried on machines from 8-32GB Ram, and using SSD or external volume for storage. @riusricardo tried to reproduce it on his laptop, but I'm not sure about results. |
We are experiencing the same issue at Composable. All the machines, regardless of configuration. Only happening since v0.9.22 though |
I have seen similar situation after the most recent Substrate upgrade of our chain on one of several nodes on my machine (consumed over 65G of RAM before I killed it, restart has shown that memory is growing again, so I stopped it permanently), but the rest of the nodes on the same machine and others are fine somehow. |
I'm trying to reproduce the issue but so far no luck. If anyone can reproduce this I'd appreciate if you could do the following:
Replace the
I recommend doing this on a machine with a lot of memory and a few spare CPU cores, since the profiling also has extra overhead. |
Thanks @koute, I have started it yesterday and now waiting for node to run out of memory. Can you contact me on element (nzt:matrix.org) so I can provide you the download link for files? |
@NZT48 Thank you for the data! I have done preliminary investigation of the problem; there's indeed a memory leak here: The leak's in an unbound channel for the network service events; the other side of the channel stopped receiving these events (either due to a deadlock, or because it was leaked) so they're just being pushed into the channel until the system runs out of memory. From what I can see the stream that's returned from here is the problem in /// Returns a stream containing the events that happen on the network.
///
/// If this method is called multiple times, the events are duplicated.
///
/// The stream never ends (unless the `NetworkWorker` gets shut down).
///
/// The name passed is used to identify the channel in the Prometheus metrics. Note that the
/// parameter is a `&'static str`, and not a `String`, in order to avoid accidentally having
/// an unbounded set of Prometheus metrics, which would be quite bad in terms of memory
pub fn event_stream(&self, name: &'static str) -> impl Stream<Item = Event> {
let (tx, rx) = out_events::channel(name);
let _ = self.to_worker.unbounded_send(ServiceToWorkerMsg::EventStream(tx));
rx
} In this case over 10 million events were pushed into this channel before the node crashed. We should probably put a hard cap on this channel or at least spam with an error if it gets to big. A high enough cap of, say, 1 million, should still make it effectively unbound in normal cases (based on this data it'd have to leak ~2.5GB of data to reach this many items) but it'd make investigating such problems easier in the future. Anyway, I'll continue investigating this. |
Here's why it's leaking:
let relay_chain_full_node = polkadot_service::build_full(
// ...
true, // This is `enable_beefy`.
// ...
)?;
if enable_beefy {
// ...
let gadget = beefy_gadget::start_beefy_gadget::<_, _, _, _, _>(beefy_params);
// ...
task_manager.spawn_handle().spawn_blocking("beefy-gadget", None, gadget);
}
let gossip_engine = sc_network_gossip::GossipEngine::new(
network,
protocol_name,
gossip_validator.clone(),
None,
);
let network_event_stream = network.event_stream(); and it only polls that event stream when the
pub(crate) async fn run(mut self) {
info!(target: "beefy", "🥩 run BEEFY worker, best grandpa: #{:?}.", self.best_grandpa_block_header.number());
self.wait_for_runtime_pallet().await;
// ...
// <-- The `GossipEngine` is polled here. ...so it waits for the BEEFY runtime pallet, but since it doesn't actually exist it ends up waiting forever, while the events in |
Fix available here: #11694 |
@skunert please fix this. Beefy should always be disabled (at least for now), for the in process relay chain node. |
Hey @bkchr - this ended up being a pretty gnarly issue for us (Turing Network on Kusama) when we upgraded to 0.9.23. https://github.com/OAK-Foundation/OAK-blockchain/wiki/2022-07-21:-Node-Memory-Leak-Issue Was there any way we could've been notified of the implications for this version upgrade? Perhaps I'm just missing a notification channel or tag to track. |
@irsal sadly there is no good channel for this at the moment. |
* Update ethy-gadget w upstream some upstream changes * if node is (major) syncing don't participate in ethy * Memory leak after node finishes syncing with relay chain paritytech/substrate#11604) * proof threshold is now set by runtime (before it was hard coded)
Is there an existing issue?
This can be potentially related issues, because they are recent and have memory issues, but this is not the same, no transactions were executed and memory leaking starts after syncing is finished:
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
After finished syncing with Polkadot, parachain node memory usage starts constantly growing until reaches maximum and then relay chain part of node stops working (only parachain logs are visible, no relaychain logs and no connection to relay chain).
At the image below is shown how memory usage starts growing after syncing is finished, node consumes around 32GB of memory in 24h:
Server specs:
Additional information:
When starting node, error logs from issue on cumulus repo are logged, but I guess this is not related to this issue.
Steps to reproduce
Compile code from this branch.
Chainspec available here.
But it would be good to get in contact about details of connecting, because this is parachain connected to Polkadot.
The text was updated successfully, but these errors were encountered: