Regen refactor > safer resilient strategy #4005

dapplion · 2022-05-10T17:51:08Z

Background

Consensus clients need to cache some states to fully participate in the network. States are very heavy so you can't cache all the states that you may need. Writing all the possible states you may need to disk is not practical either. So what do you do?

Keep in memory the few states that you need the most
Regenerate from memory or disk (i.e. re-process blocks) to access states that probabilistically are less useful.

So the stateCache and checkpointStateCache handle the first point: deciding which states to keep in memory. The regen module handles the second: provide the ability to regenerate any state, within some boundary.

Current strategy

Keep in memory the most recent 96 states regardless of forks
Keep in memory checkpoint states for the most recent ~4 epochs
Keep in memory the latest finalized state + the head state
Write to disk the latest finalized state, delete the previous finalized state

This approach works well for good network conditions. Thanks for tree structural sharing the cost of those 96 states in a linear chain is very low multiple of the cost of a single state (~1.2x).

However, during attacks, bugs or highly forked network our node quickly runs out of memory or can become unable to follow the chain.

If those 96 states are significantly different between each other structural sharing is not useful so the total memory could become 96x the cost of a single state.
Same for checkpoint states, see past example of this, which caused fast OOMs Fast OOM when syncing close to head #3171
Structural sharing is only useful for states that are close to each other. In long periods of non-finality we would regen from latest finalized state which could be hours old, potentially DOS-ing ourselves.

Relevant issues:

Unnecessary repeated epoch transitions when node is stuck #3099

Improvements goals

So, we can do better. Specifically:

Don't let the state cache cause an OOM if states are too expensive
Limit max regen cost in all cases = reduce DOS risk
Make regen as cheap as possible using both regen from memory states and disk states

Proposed strategies

1. Regen from memory and disk

On every checkpoint write a state to disk to a "hot state db" bucket. On finalization, move some of those states "cold state db" or "archive db" bucket. Then on regen, use those states depending on the max distance of the closest available state in memory if any. This would allow to drop the need to keep the finalized state in memory.

Make State Cache Pruning Smarter #2154

2. Bound regen depending on consumer

Depending on the caller, restrict the work triggered by regen

3. WeakRef state cache

Allow GC to drop state when low in memory. Cache only 3 states behind current head. Do it behind a flag to extend modes for lightclient

Ongoing issue: Reference cached beacon states with WeakRefs #3094

TBD

Closes #3099

wemeetagain · 2022-05-16T19:03:44Z

Regarding loading a state from db, iirc this is expensive, like 6+ seconds. Might be good to benchmark this and get this lower.

dapplion · 2022-05-18T04:22:48Z

Regarding loading a state from db, iirc this is expensive, like 6+ seconds. Might be good to benchmark this and get this lower.

In terms of time to result, the tradeoff math is roughly:

Load from disk = deserialize (600ms) + process few block (10ms x block) + hashTreeRoot (8000ms)
Advance old state = process many blocks (10ms x block) + process many epoch transitions (600ms x epoch) + hashTreeRoot (?? ms)

Also keep in mind that if you advance and old state significantly the cost of the final hashTreeRoot can be very high as the whole state is different.

However there's a memory limit in the amount and fork-ness of states you can keep in memory. In bad network conditions you must drop states to prevent OOMs, so regen from disk must always be available.

stale · 2022-09-21T03:12:33Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

philknows · 2023-11-05T15:37:52Z

#6008 should resolve this issue

dapplion added the scope-security Issues that fix security issues: DOS, key leak, CVEs. label May 10, 2022

This was referenced May 10, 2022

State for currentJustifiedCheckpoint not available #3358

Closed

Review regen state cache strategy #2846

Closed

wemeetagain mentioned this issue May 25, 2022

Explore sha256 conversion optimisation ChainSafe/ssz#245

Open

dapplion added the scope-none Issues that do not fit within any of the other defined scopes. label Jul 13, 2022

dapplion mentioned this issue Aug 16, 2022

finality_checkpoints endpoint does not return result #4415

Closed

g11tech mentioned this issue Sep 9, 2022

Node notifier error headState does not exist #4523

Closed

stale bot added the meta-stale Label for stale issues applied by the stale bot. label Sep 21, 2022

philknows removed the meta-stale Label for stale issues applied by the stale bot. label Sep 23, 2022

dapplion added the prio-medium Resolve this some time soon (tm). label Sep 29, 2022

This was referenced May 26, 2023

Make State Cache Pruning Smarter #2154

Closed

Review Regen module #2463

Closed

refactor: do not expose state caches outside regen #5599

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regen refactor > safer resilient strategy #4005

Regen refactor > safer resilient strategy #4005

dapplion commented May 10, 2022 •

edited

Loading

wemeetagain commented May 16, 2022

dapplion commented May 18, 2022 •

edited

Loading

stale bot commented Sep 21, 2022

philknows commented Nov 5, 2023

Regen refactor > safer resilient strategy #4005

Regen refactor > safer resilient strategy #4005

Comments

dapplion commented May 10, 2022 • edited Loading

Background

Current strategy

Improvements goals

Proposed strategies

wemeetagain commented May 16, 2022

dapplion commented May 18, 2022 • edited Loading

stale bot commented Sep 21, 2022

philknows commented Nov 5, 2023

dapplion commented May 10, 2022 •

edited

Loading

dapplion commented May 18, 2022 •

edited

Loading