Perform snapshots only after BOYD #7504

mhofman · 2023-04-25T17:17:45Z

What is the Problem Being Solved?

Moddable added a feature to only clean out WeakRef during organic gc to mitigate our known issue where liveslots exposes its sensitivity to gc in syscalls (#6784).

However one case where a forced full gc is performed is during a snapshot. The problem is that if we perform an XS upgrade that has a difference in allocation behavior, it may cause different syscalls when replaying the transcript to get the vat online if some representatives or presences were released by the program between the last BOYD and the time the snapshot was taken after.

Description of the Design

Remove snapshot scheduling, only perform snapshot after delivering a bringOutYourDead.

Possibly force a bringOutYourDead immediately after vat start to create a snapshot of the initialized vat like we have today.

Security Considerations

None

Scaling Considerations

None

Test Plan

TBD

mhofman · 2023-04-25T17:18:25Z

Extracted from #6786

warner · 2023-04-29T23:08:23Z

I've implemented a form of this, but I changed the causality. Instead of inhibiting snapshots to only happen after BOYD, I change the snapshot logic to always perform a BOYD first. So snapshots will happen according to the usual snapshotInitial/snapshotInterval schedule, but we'll get an extra BOYD just before each. That seemed cleaner, and I think we still want BOYD to happen more frequently than snapshots.

The limitations of my PR:

we don't report the extra BOYD to the runPolicy, so blocks with snapshots may run longer than expected
- (we report normal BOYDs computrons to it, just not these extra snapshot-related ones)
- no userspace code can run during a BOYD, so we've never metered BOYDs (decrementMeterID-style)
if the BOYD fails somehow, or (for some unknown reason) calls syscall.exit(), we won't process that in the usual "maybe unwind the crank" portion of processDeliveryMessage
- that might mean that vat-admin is not notified about the problem during that crank (but it might notice on the next delivery to that vat, I'm not sure)
the extra BOYD does not interact with the existing vatKeeper.countdownToReap(), which means some occasional cycle-alignments might cause a double-BOYD
- wasteful, but not damaging

This changes the snapshot scheduling logic to be more consistent. We still use `snapshotInitial` to trigger a snapshot shortly after worker initialization, and `snapshotInterval` to trigger periodic ones after that. However the previous code compared `snapshotInitial` to the absolute deliveryNum, which meant it only applied to the first incarnation, and would not attempt to take a snapshot shortly after upgrade, leaving the kernel vulnerable to replaying the long `startVat` delivery for a larger window than we intended. And `snapshotInterval` was compared against the difference between the latest transcript and the latest snapshot, which changed with the addition of the load-worker pseudo-entry. The new code uses `snapshotInitial` whenever there is not an existing snapshot (so the first span of *all* incarnations), and compares it against the length of the current span (so it includes all the pseudo-events). `snapshotInterval` is also compared against the length of the current span. The result is simpler and more predictable set of rules: * in the first span of each incarnation, trigger a snapshot once we have at least `snapshotInterval` entries * in all other spans, trigger once we have at least `snapshotInterval` In addition, when triggering a snapshot, we perform a BringOutYourDead delivery before asking the worker to save a snapshot. This gives us one last chance to shake out any garbage (making the snapshot as small as possible), and reduces the variation we might see forced GC that happens during snapshot write (any FinalizationRegistry callbacks should get run during the BOYD, not the save-snapshot). closes #7533 closes #7504

This changes the snapshot scheduling logic to be more consistent. We still use `snapshotInitial` to trigger a snapshot shortly after worker initialization, and `snapshotInterval` to trigger periodic ones after that. However the previous code compared `snapshotInitial` to the absolute deliveryNum, which meant it only applied to the first incarnation, and would not attempt to take a snapshot shortly after upgrade, leaving the kernel vulnerable to replaying the long `startVat` delivery for a larger window than we intended. And `snapshotInterval` was compared against the difference between the latest transcript and the latest snapshot, which changed with the addition of the load-worker pseudo-entry. The new code uses `snapshotInitial` whenever there is not an existing snapshot (so the first span of *all* incarnations), and compares it against the length of the current span (so it includes all the pseudo-events). `snapshotInterval` is also compared against the length of the current span. The result is simpler and more predictable set of rules: * in the first span of each incarnation, trigger a snapshot once we have at least `snapshotInterval` entries * in all other spans, trigger once we have at least `snapshotInterval` In addition, when triggering a snapshot, we perform a BringOutYourDead delivery before asking the worker to save a snapshot. This gives us one last chance to shake out any garbage (making the snapshot as small as possible), and reduces the variation we might see forced GC that happens during snapshot write (any FinalizationRegistry callbacks should get run during the BOYD, not the save-snapshot). closes #7553 closes #7504

mhofman added the enhancement New feature or request label Apr 25, 2023

mhofman assigned warner Apr 25, 2023

mhofman added the SwingSet package: SwingSet label Apr 26, 2023

warner mentioned this issue Apr 29, 2023

fix(swingset): better snapshot scheduling, do BOYD before each #7558

Merged

warner mentioned this issue Apr 29, 2023

schedule snapshot snapshotInitial deliveries after each upgrade #7553

Closed

mergify bot closed this as completed in #7558 Apr 30, 2023

mhofman mentioned this issue Apr 24, 2024

fix(swingset): use "dirt" to schedule vat reap/bringOutYourDead #9169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perform snapshots only after BOYD #7504

Perform snapshots only after BOYD #7504

mhofman commented Apr 25, 2023

mhofman commented Apr 25, 2023

warner commented Apr 29, 2023

Perform snapshots only after BOYD #7504

Perform snapshots only after BOYD #7504

Comments

mhofman commented Apr 25, 2023

What is the Problem Being Solved?

Description of the Design

Security Considerations

Scaling Considerations

Test Plan

mhofman commented Apr 25, 2023

warner commented Apr 29, 2023