Skip to content

Root Cause Analysis Template for Incidents

Chris Li edited this page May 23, 2023 · 1 revision

[RCA] 2022 07 21: Node Memory Leak Issue

Summary

The OAK team observed significant RAM growth on our nodes (collator and boot). We believe it's an issue with the upgrade to the polkadot client 0.9.23, specifically https://github.com/paritytech/substrate/issues/11604.

Timeline

(In PST or UTC-7)
0900: Begin upgrading nodes from 1.4.0 binary to 1.5.0
1015: Runtime upgrade started
1118: Runtime upgrade occurred for Turing Network
1120: Complete upgrading nodes from 1.4.0 binary to 1.5.0
1236: First alert regarding RAM
...alerts continue for other nodes
1400: OAK Devs huddle begins and is flagged as P0 (highest priority bug)
1420: OAK Devs start rolling back nodes to 1.4.0 after discussing many mitigation options.
1426: Collators were informed to keep the 1.4.0 node version/binary
1445: OAK nodes rollback to 1.4.0 complete

RCA

We did not observe this same issue occur in Turing Staging (against Rococo). We noticed that Rococo was in polkadot client version 0.9.26. Turing Staging and Turing both recently upgraded to polkadot client 0.9.23. We looked into the delta between the two, and noticed this issue: https://github.com/paritytech/substrate/issues/11604

After finished syncing with Polkadot, parachain node memory usage starts constantly growing until reaches maximum and then relay chain part of node stops working (only parachain logs are visible, no relaychain logs and no connection to relay chain).

This was what we also observed in our node dashboards. The metrics align to our timelines above. RAM continues to flatline after the downgrade to 1.4.0. Screen Shot 2022-07-21 at 3 33 50 PM

We noticed that the cumulus version with 0.9.24 had beefy disabled. https://github.com/paritytech/cumulus/pull/1382/files

Possible solutions

  1. Rollback nodes to 1.4.0 and observe if the mem leak persists.
  2. Upgrade to 0.9.26 for the OAK-blockchain. The problem with this is our dependency with the moonbeam's parachain staking. Specifically waiting on this PR: https://github.com/PureStake/moonbeam/pull/1699
  3. Cherry-pick clients related to the beefy disable change. The dependency graph was too vast.

Short-term Mitigation

Rolling nodes back to 1.4.0. The collators must remain in 1.4.0. This seems to mitigate the leak, and we're not seeing much growth for RAM, but the OAK team will continue to monitor.

Long-term Mitigation

  1. Kusama upgrade to 0.9.26 client might mitigate this issue. https://kusama.polkassembly.io/referendum/216 This will happen on Tuesday July 26 (per governance). Once the governance proposal has passed we will test the 1.5.0 turing client to evaluate if the memory leak still presents an issue. It's unclear whether this will fix the problem; however, the issue was not seen on Rococo (already on the 0.9.26 client).

  2. We will upgrade the turing client to use polkadot-v0.9.26. This is a bit more complicated to expedite given dependencies to other projects, specifically moonbeam's parachain-staking.

  3. Figure out ways to decouple dependencies to enable hotfixes for issues such as this.

  4. Figure out how to test / deploy, while observing issues in Turing Staging first.

Other references