Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone states only when necessary #4279

Merged
merged 4 commits into from
Jul 18, 2022
Merged

Conversation

dapplion
Copy link
Contributor

@dapplion dapplion commented Jul 11, 2022

Motivation

SSZ v2 introduced a new caching strategy:

  • cache in not structurally share, only exists in a single instance
  • cache is transfered "forward" on each clone
  • The state source of the clone "loses" its cache

Description

To prevent losing the cache to only-read clones, clone the state only when it's mutated via a state transition run

@wemeetagain @tuyennhv This PR can have serious implications, please review in depth and think of corner cases. This should be tested on nodes too

  • Deployed to contabo-5,contabo-18,contabo-19 Jul 11 2022 19:54:58 GMT+0000

@dapplion dapplion requested a review from a team as a code owner July 11, 2022 19:51
@@ -27,6 +27,7 @@ export function stateTransition(
const block = signedBlock.message;
const blockSlot = block.slot;

// .clone() before mutating state in state transition
let postState = state.clone();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and line 85 are the only two instances where state is cloned

@github-actions
Copy link
Contributor

github-actions bot commented Jul 11, 2022

Performance Report

✔️ no performance regression detected

Full benchmark results
Benchmark suite Current: 1d5d8c7 Previous: 6ed5ae0 Ratio
getPubkeys - index2pubkey - req 1000 vs - 250000 vc 2.7176 ms/op 1.9148 ms/op 1.42
getPubkeys - validatorsArr - req 1000 vs - 250000 vc 89.200 us/op 51.946 us/op 1.72
BLS verify - blst-native 2.6330 ms/op 1.5646 ms/op 1.68
BLS verifyMultipleSignatures 3 - blst-native 5.1175 ms/op 3.1981 ms/op 1.60
BLS verifyMultipleSignatures 8 - blst-native 10.762 ms/op 6.9368 ms/op 1.55
BLS verifyMultipleSignatures 32 - blst-native 38.100 ms/op 25.140 ms/op 1.52
BLS aggregatePubkeys 32 - blst-native 51.322 us/op 33.451 us/op 1.53
BLS aggregatePubkeys 128 - blst-native 195.32 us/op 132.09 us/op 1.48
getAttestationsForBlock 58.412 ms/op 34.585 ms/op 1.69
isKnown best case - 1 super set check 537.00 ns/op 353.00 ns/op 1.52
isKnown normal case - 2 super set checks 549.00 ns/op 348.00 ns/op 1.58
isKnown worse case - 16 super set checks 523.00 ns/op 345.00 ns/op 1.52
CheckpointStateCache - add get delete 12.171 us/op 8.9240 us/op 1.36
validate gossip signedAggregateAndProof - struct 5.3902 ms/op 3.6850 ms/op 1.46
validate gossip attestation - struct 2.5149 ms/op 1.6983 ms/op 1.48
altair verifyImport mainnet_s3766816:31 7.9247 s/op 5.2897 s/op 1.50
pickEth1Vote - no votes 2.6706 ms/op 1.7027 ms/op 1.57
pickEth1Vote - max votes 31.148 ms/op 23.789 ms/op 1.31
pickEth1Vote - Eth1Data hashTreeRoot value x2048 15.129 ms/op 11.431 ms/op 1.32
pickEth1Vote - Eth1Data hashTreeRoot tree x2048 26.744 ms/op 16.421 ms/op 1.63
pickEth1Vote - Eth1Data fastSerialize value x2048 1.9331 ms/op 1.0732 ms/op 1.80
pickEth1Vote - Eth1Data fastSerialize tree x2048 23.142 ms/op 10.161 ms/op 2.28
bytes32 toHexString 1.3410 us/op 700.00 ns/op 1.92
bytes32 Buffer.toString(hex) 849.00 ns/op 541.00 ns/op 1.57
bytes32 Buffer.toString(hex) from Uint8Array 1.0660 us/op 734.00 ns/op 1.45
bytes32 Buffer.toString(hex) + 0x 845.00 ns/op 543.00 ns/op 1.56
Object access 1 prop 0.44300 ns/op 0.26900 ns/op 1.65
Map access 1 prop 0.34900 ns/op 0.22300 ns/op 1.57
Object get x1000 16.960 ns/op 8.9900 ns/op 1.89
Map get x1000 1.0260 ns/op 0.70100 ns/op 1.46
Object set x1000 122.28 ns/op 51.398 ns/op 2.38
Map set x1000 84.749 ns/op 35.713 ns/op 2.37
Return object 10000 times 0.42910 ns/op 0.31410 ns/op 1.37
Throw Error 10000 times 7.0245 us/op 4.2823 us/op 1.64
enrSubnets - fastDeserialize 64 bits 3.3890 us/op 2.0570 us/op 1.65
enrSubnets - ssz BitVector 64 bits 994.00 ns/op 645.00 ns/op 1.54
enrSubnets - fastDeserialize 4 bits 470.00 ns/op 298.00 ns/op 1.58
enrSubnets - ssz BitVector 4 bits 937.00 ns/op 618.00 ns/op 1.52
prioritizePeers score -10:0 att 32-0.1 sync 2-0 118.59 us/op 60.756 us/op 1.95
prioritizePeers score 0:0 att 32-0.25 sync 2-0.25 182.82 us/op 83.760 us/op 2.18
prioritizePeers score 0:0 att 32-0.5 sync 2-0.5 320.26 us/op 152.97 us/op 2.09
prioritizePeers score 0:0 att 64-0.75 sync 4-0.75 539.23 us/op 295.19 us/op 1.83
prioritizePeers score 0:0 att 64-1 sync 4-1 565.96 us/op 298.00 us/op 1.90
RateTracker 1000000 limit, 1 obj count per request 210.15 ns/op 133.06 ns/op 1.58
RateTracker 1000000 limit, 2 obj count per request 157.04 ns/op 97.998 ns/op 1.60
RateTracker 1000000 limit, 4 obj count per request 135.22 ns/op 81.323 ns/op 1.66
RateTracker 1000000 limit, 8 obj count per request 118.82 ns/op 74.347 ns/op 1.60
RateTracker with prune 5.8660 us/op 3.3450 us/op 1.75
array of 16000 items push then shift 5.3823 us/op 36.990 us/op 0.15
LinkedList of 16000 items push then shift 32.707 ns/op 13.926 ns/op 2.35
array of 16000 items push then pop 274.49 ns/op 149.80 ns/op 1.83
LinkedList of 16000 items push then pop 25.995 ns/op 11.731 ns/op 2.22
array of 24000 items push then shift 8.3352 us/op 55.465 us/op 0.15
LinkedList of 24000 items push then shift 32.267 ns/op 18.465 ns/op 1.75
array of 24000 items push then pop 230.05 ns/op 127.31 ns/op 1.81
LinkedList of 24000 items push then pop 25.233 ns/op 13.103 ns/op 1.93
intersect bitArray bitLen 8 12.420 ns/op 7.7860 ns/op 1.60
intersect array and set length 8 221.01 ns/op 110.81 ns/op 1.99
intersect bitArray bitLen 128 75.807 ns/op 41.275 ns/op 1.84
intersect array and set length 128 2.6709 us/op 1.3985 us/op 1.91
pass gossip attestations to forkchoice per slot 6.0847 ms/op 2.3413 ms/op 2.60
computeDeltas 4.8175 ms/op 2.8656 ms/op 1.68
computeProposerBoostScoreFromBalances 913.41 us/op 578.28 us/op 1.58
altair processAttestation - 250000 vs - 7PWei normalcase 5.5641 ms/op 3.0191 ms/op 1.84
altair processAttestation - 250000 vs - 7PWei worstcase 7.9829 ms/op 4.4972 ms/op 1.78
altair processAttestation - setStatus - 1/6 committees join 275.41 us/op 134.28 us/op 2.05
altair processAttestation - setStatus - 1/3 committees join 528.96 us/op 257.61 us/op 2.05
altair processAttestation - setStatus - 1/2 committees join 737.49 us/op 363.55 us/op 2.03
altair processAttestation - setStatus - 2/3 committees join 992.43 us/op 473.82 us/op 2.09
altair processAttestation - setStatus - 4/5 committees join 1.3245 ms/op 657.92 us/op 2.01
altair processAttestation - setStatus - 100% committees join 1.6533 ms/op 788.15 us/op 2.10
altair processBlock - 250000 vs - 7PWei normalcase 33.169 ms/op 22.096 ms/op 1.50
altair processBlock - 250000 vs - 7PWei normalcase hashState 48.611 ms/op 28.503 ms/op 1.71
altair processBlock - 250000 vs - 7PWei worstcase 106.94 ms/op 65.923 ms/op 1.62
altair processBlock - 250000 vs - 7PWei worstcase hashState 124.05 ms/op 94.593 ms/op 1.31
phase0 processBlock - 250000 vs - 7PWei normalcase 4.5194 ms/op 2.7384 ms/op 1.65
phase0 processBlock - 250000 vs - 7PWei worstcase 54.683 ms/op 38.419 ms/op 1.42
altair processEth1Data - 250000 vs - 7PWei normalcase 1.1336 ms/op 662.97 us/op 1.71
Tree 40 250000 create 1.0739 s/op 624.04 ms/op 1.72
Tree 40 250000 get(125000) 340.54 ns/op 174.32 ns/op 1.95
Tree 40 250000 set(125000) 3.4977 us/op 1.8270 us/op 1.91
Tree 40 250000 toArray() 37.108 ms/op 22.195 ms/op 1.67
Tree 40 250000 iterate all - toArray() + loop 38.048 ms/op 22.482 ms/op 1.69
Tree 40 250000 iterate all - get(i) 141.34 ms/op 87.130 ms/op 1.62
MutableVector 250000 create 17.828 ms/op 11.542 ms/op 1.54
MutableVector 250000 get(125000) 14.318 ns/op 7.6440 ns/op 1.87
MutableVector 250000 set(125000) 877.94 ns/op 482.68 ns/op 1.82
MutableVector 250000 toArray() 6.8348 ms/op 4.5758 ms/op 1.49
MutableVector 250000 iterate all - toArray() + loop 8.2101 ms/op 5.0312 ms/op 1.63
MutableVector 250000 iterate all - get(i) 3.3278 ms/op 1.9110 ms/op 1.74
Array 250000 create 6.3229 ms/op 4.1198 ms/op 1.53
Array 250000 clone - spread 2.4196 ms/op 1.8577 ms/op 1.30
Array 250000 get(125000) 1.2060 ns/op 0.93200 ns/op 1.29
Array 250000 set(125000) 1.1480 ns/op 0.93400 ns/op 1.23
Array 250000 iterate all - loop 149.15 us/op 108.63 us/op 1.37
effectiveBalanceIncrements clone Uint8Array 300000 81.074 us/op 123.98 us/op 0.65
effectiveBalanceIncrements clone MutableVector 300000 763.00 ns/op 514.00 ns/op 1.48
effectiveBalanceIncrements rw all Uint8Array 300000 292.31 us/op 177.19 us/op 1.65
effectiveBalanceIncrements rw all MutableVector 300000 220.21 ms/op 114.35 ms/op 1.93
phase0 afterProcessEpoch - 250000 vs - 7PWei 199.14 ms/op 148.63 ms/op 1.34
phase0 beforeProcessEpoch - 250000 vs - 7PWei 83.968 ms/op 46.696 ms/op 1.80
altair processEpoch - mainnet_e81889 687.40 ms/op 371.04 ms/op 1.85
mainnet_e81889 - altair beforeProcessEpoch 169.64 ms/op 101.87 ms/op 1.67
mainnet_e81889 - altair processJustificationAndFinalization 64.531 us/op 22.473 us/op 2.87
mainnet_e81889 - altair processInactivityUpdates 12.348 ms/op 6.7226 ms/op 1.84
mainnet_e81889 - altair processRewardsAndPenalties 208.19 ms/op 71.741 ms/op 2.90
mainnet_e81889 - altair processRegistryUpdates 16.206 us/op 4.1590 us/op 3.90
mainnet_e81889 - altair processSlashings 3.6480 us/op 827.00 ns/op 4.41
mainnet_e81889 - altair processEth1DataReset 4.1450 us/op 991.00 ns/op 4.18
mainnet_e81889 - altair processEffectiveBalanceUpdates 3.2780 ms/op 1.4293 ms/op 2.29
mainnet_e81889 - altair processSlashingsReset 25.728 us/op 4.9210 us/op 5.23
mainnet_e81889 - altair processRandaoMixesReset 23.882 us/op 6.2280 us/op 3.83
mainnet_e81889 - altair processHistoricalRootsUpdate 3.8600 us/op 931.00 ns/op 4.15
mainnet_e81889 - altair processParticipationFlagUpdates 14.405 us/op 3.8840 us/op 3.71
mainnet_e81889 - altair processSyncCommitteeUpdates 3.3100 us/op 791.00 ns/op 4.18
mainnet_e81889 - altair afterProcessEpoch 219.34 ms/op 165.96 ms/op 1.32
phase0 processEpoch - mainnet_e58758 637.47 ms/op 391.89 ms/op 1.63
mainnet_e58758 - phase0 beforeProcessEpoch 284.37 ms/op 163.54 ms/op 1.74
mainnet_e58758 - phase0 processJustificationAndFinalization 64.040 us/op 17.582 us/op 3.64
mainnet_e58758 - phase0 processRewardsAndPenalties 155.28 ms/op 63.057 ms/op 2.46
mainnet_e58758 - phase0 processRegistryUpdates 34.338 us/op 7.6600 us/op 4.48
mainnet_e58758 - phase0 processSlashings 3.0690 us/op 841.00 ns/op 3.65
mainnet_e58758 - phase0 processEth1DataReset 3.0930 us/op 872.00 ns/op 3.55
mainnet_e58758 - phase0 processEffectiveBalanceUpdates 2.4449 ms/op 1.1694 ms/op 2.09
mainnet_e58758 - phase0 processSlashingsReset 18.000 us/op 5.2000 us/op 3.46
mainnet_e58758 - phase0 processRandaoMixesReset 26.119 us/op 6.1010 us/op 4.28
mainnet_e58758 - phase0 processHistoricalRootsUpdate 4.1170 us/op 890.00 ns/op 4.63
mainnet_e58758 - phase0 processParticipationRecordUpdates 22.273 us/op 5.5050 us/op 4.05
mainnet_e58758 - phase0 afterProcessEpoch 180.49 ms/op 126.11 ms/op 1.43
phase0 processEffectiveBalanceUpdates - 250000 normalcase 2.3583 ms/op 1.4364 ms/op 1.64
phase0 processEffectiveBalanceUpdates - 250000 worstcase 0.5 2.8842 ms/op 1.5832 ms/op 1.82
altair processInactivityUpdates - 250000 normalcase 53.881 ms/op 37.881 ms/op 1.42
altair processInactivityUpdates - 250000 worstcase 56.703 ms/op 30.704 ms/op 1.85
phase0 processRegistryUpdates - 250000 normalcase 27.601 us/op 9.6090 us/op 2.87
phase0 processRegistryUpdates - 250000 badcase_full_deposits 522.18 us/op 275.81 us/op 1.89
phase0 processRegistryUpdates - 250000 worstcase 0.5 263.78 ms/op 143.52 ms/op 1.84
altair processRewardsAndPenalties - 250000 normalcase 109.40 ms/op 69.337 ms/op 1.58
altair processRewardsAndPenalties - 250000 worstcase 161.79 ms/op 105.70 ms/op 1.53
phase0 getAttestationDeltas - 250000 normalcase 13.612 ms/op 9.2822 ms/op 1.47
phase0 getAttestationDeltas - 250000 worstcase 13.975 ms/op 8.6617 ms/op 1.61
phase0 processSlashings - 250000 worstcase 6.7385 ms/op 3.7107 ms/op 1.82
altair processSyncCommitteeUpdates - 250000 342.93 ms/op 214.77 ms/op 1.60
BeaconState.hashTreeRoot - No change 637.00 ns/op 393.00 ns/op 1.62
BeaconState.hashTreeRoot - 1 full validator 82.021 us/op 48.739 us/op 1.68
BeaconState.hashTreeRoot - 32 full validator 851.40 us/op 466.86 us/op 1.82
BeaconState.hashTreeRoot - 512 full validator 8.2395 ms/op 5.1841 ms/op 1.59
BeaconState.hashTreeRoot - 1 validator.effectiveBalance 119.63 us/op 60.800 us/op 1.97
BeaconState.hashTreeRoot - 32 validator.effectiveBalance 1.6421 ms/op 866.50 us/op 1.90
BeaconState.hashTreeRoot - 512 validator.effectiveBalance 21.505 ms/op 11.122 ms/op 1.93
BeaconState.hashTreeRoot - 1 balances 88.048 us/op 47.245 us/op 1.86
BeaconState.hashTreeRoot - 32 balances 802.23 us/op 426.19 us/op 1.88
BeaconState.hashTreeRoot - 512 balances 9.1021 ms/op 3.8653 ms/op 2.35
BeaconState.hashTreeRoot - 250000 balances 116.69 ms/op 73.236 ms/op 1.59
aggregationBits - 2048 els - zipIndexesInBitList 41.912 us/op 27.366 us/op 1.53
regular array get 100000 times 59.039 us/op 45.132 us/op 1.31
wrappedArray get 100000 times 58.967 us/op 47.205 us/op 1.25
arrayWithProxy get 100000 times 36.235 ms/op 19.976 ms/op 1.81
ssz.Root.equals 622.00 ns/op 418.00 ns/op 1.49
byteArrayEquals 601.00 ns/op 418.00 ns/op 1.44
shuffle list - 16384 els 12.776 ms/op 8.3475 ms/op 1.53
shuffle list - 250000 els 186.58 ms/op 131.78 ms/op 1.42
processSlot - 1 slots 18.575 us/op 9.5210 us/op 1.95
processSlot - 32 slots 2.5724 ms/op 1.3926 ms/op 1.85
getEffectiveBalanceIncrementsZeroInactive - 250000 vs - 7PWei 642.91 us/op 258.78 us/op 2.48
getCommitteeAssignments - req 1 vs - 250000 vc 5.7539 ms/op 3.9117 ms/op 1.47
getCommitteeAssignments - req 100 vs - 250000 vc 8.1416 ms/op 5.7227 ms/op 1.42
getCommitteeAssignments - req 1000 vs - 250000 vc 8.4219 ms/op 6.0893 ms/op 1.38
computeProposers - vc 250000 24.565 ms/op 12.885 ms/op 1.91
computeEpochShuffling - vc 250000 183.22 ms/op 124.59 ms/op 1.47
getNextSyncCommittee - vc 250000 348.19 ms/op 215.22 ms/op 1.62

by benchmarkbot/action

@dapplion
Copy link
Contributor Author

dapplion commented Jul 12, 2022

In all contabo-5,contabo-18,contabo-19 there's a noticeable improvement in block processing times (deployed branch at the blue line)

Screenshot from 2022-07-12 08-35-56

export function initiateValidatorExit(state: CachedBeaconStateAllForks, validator: phase0.Validator): void {
export function initiateValidatorExit(
state: CachedBeaconStateAllForks,
validator: CompositeViewDU<typeof ssz.phase0.Validator>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using CompositeViewDU to express the need for mutability here

postState = processSlots(postState, nextEpochSlot, metrics);

// Cache state to preserve epoch transition work
const checkpointState = postState.clone();
const checkpointState = postState;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dapplion I think it's safer to do postState.clone(false)?

from your implementation

  clone(dontTransferCache?: boolean): this {
    if (dontTransferCache) {
      return this.type.getViewDU(this.node) as this;
    } else {
      const cache = this.cache;
      this.clearCache();
      return this.type.getViewDU(this.node, cache) as this;
    }
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't follow, what would it be safer to do a clone here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tuyennhv The point of this PR is to think critically about when our state can actually be mutated. My opinion is that we clone is too many places that a state can't be mutated so cloning is an unnecessary expense that slows than Lodestar for no gain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dapplion in the past we used to have an issue of missing a clone() call, it really took time to debug in that case, @wemeetagain may experience it

so what I mean is instead of dropping the clone() call, can we do clone(false) with less performance impact?

if both clone() and clone(false) cause performance issue then we need to benchmark to improve performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My confusion: I thought the cache is merely used for the mutation, it also contains the index of nodes so that we only have to traverse to the node once.

if we use clone(false), the consumer does not have a chance to use the cached nodes of the tree so it's all bad going with either clone() or clone(false)

with ssz v2, we should not ever have a state mutation unless in state transition, thanks @dapplion for your explanation 👍


// Clone first to account for metrics below
const itemCloned = item.clone();

this.metrics?.stateClonedCount.observe(item.clonedCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't call clone() anymore, should metric observe clonedCount here?

Copy link
Contributor

@twoeths twoeths left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspected all the state.clone() call, I think the change is safe

Copy link
Member

@wemeetagain wemeetagain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but might be worth deploying to a node to test first

@wemeetagain
Copy link
Member

Oh, just noticed the metrics above

@wemeetagain wemeetagain merged commit 7db4e4a into unstable Jul 18, 2022
@wemeetagain wemeetagain deleted the dapplion/clone-when-necessary branch July 18, 2022 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants