Performance impact of BOYD #4160

mhofman · 2021-12-07T21:52:52Z

Describe the bug

The bring out your dead PR (#3801) caused a jump in swingset time. There also seem to be more time spent in swingset during a block over the life of the chain, which is usually indicative of a memory leak.

To Reproduce

Steps to reproduce the behavior:

Follow instructions to setup loadgen
Start a loadgen test with the following parameters: --stage.duration=10 --stage.loadgen.vault.interval=12 --stage.loadgen.amm.interval=12 --stage.loadgen.amm.wait=6 --stage.loadgen.vault.limit=10 --stage.loadgen.amm.limit=10

Expected behavior

Not such a significant increase in swingset time. No further increase over the life of the chain

Platform Environment

Loadgen docker image
is there anything special/unusual about your platform? WSL2
what version of the Agoric-SDK are you using? 9a47516a349185409d3c320b585312178c8a08b8

Additional context

The PR doesn't change the GC schedule so the impact is likely from the different accounting logic that BYOD introduces.

Screenshots

"sdkRevision","sdkCommitTime","success","walletDeployDuration","loadgenDeployDuration","cycleSuccessRate","cycleAvgDuration","stage1_cycleSuccessRate","stage1_cycleAvgDuration","stage1_avgSwingsetBlockTime","stage1_avgSwingsetPercentage","stage2_cycleSuccessRate","stage2_cycleAvgDuration","stage2_avgSwingsetBlockTime","stage2_avgSwingsetPercentage","stage3_cycleSuccessRate","stage3_cycleAvgDuration","stage3_avgSwingsetBlockTime","stage3_avgSwingsetPercentage","stage4_cycleSuccessRate","stage4_cycleAvgDuration","stage4_avgSwingsetBlockTime","stage4_avgSwingsetPercentage"
"da7c924aa","2021-11-01 23:00:26",true,238.067808,94.07785,100,219.902715,100,214.911984,1.254112,24.89,100,218.745497,1.391519,27.61,100,226.627465,1.402919,27.84,100,220.151802,1.425838,28.14
"9a47516a3","2021-11-03 01:10:09",false,238.950638,98.910524,100,234.284053,100,235.138661,1.713511,34.03,100,228.431566,1.821268,36.14,100,239.761833,2.357044,47.26,,,,

NB: Stage 4 is missing due to an unrelated chain stall on restart

The text was updated successfully, but these errors were encountered:

warner · 2022-02-25T22:47:09Z

Maybe a good ticket for our new hire.

Tartuffo · 2022-03-07T23:36:58Z

FYI @arirubinstein

Tartuffo · 2022-03-25T20:46:06Z

We could probably get a lot of benefit from some tuning of parameters, probably worth doing that for MN-1, and further fine tuning controls for MN-1.1.

warner · 2022-06-24T22:30:27Z

@FUDCo and I figure that we should set this to like 1 BOYD per 1000 deliveries for now.

The risk is allowing one vat (which doesn't get constant traffic) to hold onto objects that keep other vats from dropping things. Chip points out that this is only really a problem if it impedes operations. We think we could build a follower node that periodically makes a separate copy of the kernel state, then detaches that copy from the chain, and sends a BOYD to all vats, and measures the change in the kernel object table size. This would let us keep track of how much garbage is "floating" on the chain, waiting for a BOYD, and that might tell us 1: it's not that much, we don't need to worry about it, or 2: we should do something at the next convenient point of upgrade.

While we were debugging performance problems last year, we introduced `managerType: 'xs-worker-no-gc'`, which is a variant of `xs-worker` that disables the forced GC we were doing at the end of every delivery. Since then, we've improved performance in several ways: * XS bugs were retaining objects, which have since been fixed * we only do a forced GC during special `dispatch.bringOutYourDead` cranks * we deliver BOYD less often (according to a configurable schedule, see #4160 for our decisions, but we're thinking once every 100 to 1000 deliveries) The `xs-worker-no-gc` manager type controlled a flag named `gcEveryCrank`, which replaced the GC primitive available to liveslots with a dummy version. This commit removes both `xs-worker-no-gc` and `gcEveryCrank`. closes #5600

Tartuffo · 2022-07-19T19:13:09Z

We still need to choose the number and test it out.

Tartuffo · 2022-07-19T19:37:09Z

@arirubinstein has done some benchmarking and has a proposed number based on that benchmarking.

There are two places it can get set - default value in the code itself (currently set to 1), and the config file for the swingset you are actually running.

We need to decide where to make the change. My preference is to change the default.

Tartuffo · 2022-08-02T20:06:43Z

@FUDCo Brian and I just discussed, can you please start working on this?

Completes #4160

mhofman added the bug Something isn't working label Dec 7, 2021

mhofman assigned warner and FUDCo Dec 7, 2021

FUDCo changed the title ~~Performance impact of BYOD~~ Performance impact of BOYD Dec 7, 2021

dckc added the performance Performance related issues label Dec 7, 2021

mhofman added the SwingSet package: SwingSet label Jan 20, 2022

Tartuffo added MN-1 and removed MN-1 labels Feb 2, 2022

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

warner mentioned this issue Apr 27, 2022

kernel parameter configuration API and chain-side controls #5237

Closed

warner mentioned this issue Jun 30, 2022

fix(swingset): remove xs-worker-no-gc and gcEveryCrank #5707

Merged

Tartuffo added the MUST-HAVE label Jul 19, 2022

Tartuffo unassigned warner Aug 2, 2022

FUDCo added a commit that referenced this issue Aug 3, 2022

fix: add a defaultReapInterval setting to active swingset configurations

4f4ab5b

Completes #4160

FUDCo mentioned this issue Aug 3, 2022

Add a defaultReapInterval setting to active swingset configurations #5884

Merged

mergify bot closed this as completed in #5884 Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance impact of BOYD #4160

Performance impact of BOYD #4160

mhofman commented Dec 7, 2021

warner commented Feb 25, 2022

Tartuffo commented Mar 7, 2022

Tartuffo commented Mar 25, 2022

warner commented Jun 24, 2022

Tartuffo commented Jul 19, 2022

Tartuffo commented Jul 19, 2022

Tartuffo commented Aug 2, 2022

Performance impact of BOYD #4160

Performance impact of BOYD #4160

Comments

mhofman commented Dec 7, 2021

Describe the bug

To Reproduce

Expected behavior

Platform Environment

Additional context

Screenshots

warner commented Feb 25, 2022

Tartuffo commented Mar 7, 2022

Tartuffo commented Mar 25, 2022

warner commented Jun 24, 2022

Tartuffo commented Jul 19, 2022

Tartuffo commented Jul 19, 2022

Tartuffo commented Aug 2, 2022