Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Execution State] Avoid creating separate MTrie state during checkpoint creation for about -200GB peak RAM use and -32 minutes duration #2286

Closed
2 tasks done
fxamacker opened this issue Apr 9, 2022 · 2 comments · Fixed by #2792
Assignees
Labels
Execution Cadence Execution Team Performance

Comments

@fxamacker
Copy link
Member

fxamacker commented Apr 9, 2022

EDIT: When deployed on August 24, 2022, the PR reduced peak RAM use by over 200GB (out of over 300GB total reduction). Initial estimate of -150GB was based on old checkpoint file. By August, checkpoint file grew substantially so memory savings were better. Duration is about 16 minutes today (Sep 7), it was 46-58 minutes in mid-August, and it was 11-17 hours in Dec 2021 depending on system load.

Problem

Recent increase in transactions is causing WAL files to get created more frequently, causing checkpoints to happen more frequently, increasing checkpoint file size, and increasing ledger state size in memory. These increases are causing checkpointing to consume too much RAM and take more than 2x longer than earlier this year.

File Size Checkpoint Frequency
Early 2022 53 GB 0-2 times per day
July 8, 2022 126 GB every 2 hours

Without PR #1944 the system checkpointing would currently be:

  • taking well over 20-30 hours each time, making it impossible to complete every 2 hours
  • requiring more operational RAM, making OOM crashes very frequent
  • creating billions more allocations and gc pressure, consuming CPU cycles and slowing down EN

After PR #1944 reduced Mtrie flattening and serialization phase to under 5 minutes (which sometimes took 17 hours on mainnet16), creating a separate MTrie state currently accounts for most of the duration and memory used by checkpointing. This opens up new possibilities such as reusing ledger state to significantly reduce duration and operational RAM of checkpointing again.

Updates epic #1744

The Proposed Solution

We can avoid creating a separate MTrie state during checkpoint creation. This can reduce peak RAM use by (very roughly) about 150GB and reduce checkpoint duration by 24 minutes (estimates based on snapshot of July 8, 2022). Memory savings will increase over time.

Determine if it's feasible to avoid creating a separate MTrie state during checkpoint creation. If the poof-of-concept doesn't reveal showstoppers then proceed with new PR.

@fxamacker fxamacker self-assigned this Apr 9, 2022
@m4ksio
Copy link
Contributor

m4ksio commented Apr 12, 2022

Currently WAL/Checkpoints are disconnected from the mtrie progression - checkpoint is essentially a state after given complete WAL segment, and segments creation is an implementation detail for mForest and WAL.
If, however, we were able to signal the moment of new WAL creation, we should stop evicting mtries/keep a separate index and use it to create new checkpoint without reallocating memory.

@fxamacker fxamacker added the Execution Cadence Execution Team label Jun 16, 2022
@fxamacker fxamacker changed the title [Execution State] Determine if it's feasible to avoid creating separate MTrie state during checkpoint creation [Execution State] Avoid creating separate MTrie state during checkpoint creation to reduce peak RAM use by 152GB and checkpoint duration by 24 minutes Jul 12, 2022
@fxamacker fxamacker changed the title [Execution State] Avoid creating separate MTrie state during checkpoint creation to reduce peak RAM use by 152GB and checkpoint duration by 24 minutes [Execution State] Avoid creating separate MTrie state during checkpoint creation to reduce peak RAM use by ~150GB and checkpoint duration by 24 minutes Aug 12, 2022
@fxamacker
Copy link
Member Author

Closed by #2792

@fxamacker fxamacker changed the title [Execution State] Avoid creating separate MTrie state during checkpoint creation to reduce peak RAM use by ~150GB and checkpoint duration by 24 minutes [Execution State] Avoid creating separate MTrie state during checkpoint creation for about -330GB peak RAM use and -32 minutes duration Sep 8, 2022
@fxamacker fxamacker changed the title [Execution State] Avoid creating separate MTrie state during checkpoint creation for about -330GB peak RAM use and -32 minutes duration [Execution State] Avoid creating separate MTrie state during checkpoint creation for about -200GB peak RAM use and -32 minutes duration Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Execution Cadence Execution Team Performance
Projects
None yet
2 participants