Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non traumatic major XS upgrades #7855

Open
mhofman opened this issue May 26, 2023 · 8 comments
Open

Non traumatic major XS upgrades #7855

mhofman opened this issue May 26, 2023 · 8 comments
Assignees
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request needs-design swing-store SwingSet package: SwingSet

Comments

@mhofman
Copy link
Member

mhofman commented May 26, 2023

What is the Problem Being Solved?

#6361 describes conditions for which we can upgrade XS in a chain upgrade and have all vats use that version of XS going forward. The main expectation is that snapshot are at least compatible: the new version of XS can load from the old version of XS, and keep executing as previously recorded.

The main problem is of incompatible snapshots, such as when a major or minor version update of XS occurs, or when new globals are implemented by XS. All the other requirements are believed to be possible already: the execution as seen by the transcript in newer versions of XS will be the same as what was recorded in the previous version.

#6361 references using multiple versions of XS (further defined in #6596) and performing vat upgrades to switch vats to the newer version. This issue explores an alternative that doesn't introduce any upgrade trauma, nor requires multiple versions of XS being distributed.

Pre-requisite knowledge on the current implementation

While liveslots's implementation is still revealing organic gc, in #7498 (and its follow up #7552), we've basically hidden organic gc from liveslots. In #7558 we make sure that the effects of snapshots (which perform a full forced gc) are not observable in transcripts after the snapshot is taken. We believe that together, this makes our vat transcripts fully independent of any engine allocation behavioral differences.

In #7484 we introduced transcript entries that capture snapshot information (hashes) in the transcript. This makes the transcript somewhat dependent on the version of XS, but these are not actual deliveries, so they can be handled.

There is still the possibility that metering limits would cause a single crank to fail where it previously succeeded, but that is currently unlikely.

With the introduction of state sync (#7225), validators may not have the full transcript content of previous spans (between the latest incarnation start, and the latest snapshot taken). However the hashes of previous spans are kept in the swing-store to support repopulating these historical transcript entries.

Newer versions of XS may introduce new intrinsics. In general these new intrinsics should not impact code execution, however our current SES version is sensitive to new well-known symbols (endojs/endo#1577), and thus would fail on new XS versions that add any unsupported symbols.

Description of the Design

The general idea is to rely on vat transcript replays to regenerate the snapshots and transcript span hashes.

We believe that validators are ok with an upgrade taking some reasonable amount of time to complete (in the order of multiple minutes, likely less than an hour). As such we may be able to perform at least part of this vat transcript replay during the upgrade, but we likely want to streamline the process by making it possible to preprocess some of the replay task.

Replay and regeneration of transcript

The regeneration process would be roughly as follow:

  • Remove snapStore entries for the latest vat incarnation (optional)
  • Empty the hashes for the transcript spans of the latest vat incarnation
  • Start replaying using the new XS version from the first transcript span with an empty span hash
    • an empty hash indicates a span that has not yet been replayed in the newer version
  • When reaching a snapshot save entry, regenerate new snapshot
    • update snapStore hash and save snapshot
    • remove the previous span's snapshot data in the snapStore
    • update save transcript entry with new snapshot hash
    • update next span's transcript load entry with new snapshot hash
  • When reaching end of span, save new rolling transcript hash in the span table
    • this span has now successfully completed its replay

Offline pre-processing

This rough process allows doing partial replays of transcripts which can be later resumed. If applied as a pseudo-diff, it also allows the transcript to keep growing after being exported for offline processing:

  • An export of the swingstore (like the one used for state-sync or a future genesis export) is used to capture the artifacts and export data related to the latest incarnation of every vat.
    • If artifacts for historical transcript spans are missing, they can be retrieved from an archive node out of band
      • They are verifiable through the exported transcript span hashes
    • the snapstore (or bundle) artifacts do not need to be exported
    • the "export data" not related to validation of transcript data is not needed
  • The offline tool keeps track of:
    • updates to the transcript entries (namely load an save snapshot entries)
    • new span hashes being generated
      • in the offline tool, an new span hash must not be generated/recorded for the last/"current" span
    • new snapshot hashes and data
  • At upgrade, we perform the following:
    • Empty the hashes for the transcript spans of the latest vat incarnation
    • Lookup if any offline data exists for that vat incarnation, and apply the pseudo-diff
    • proceed with the regeneration replay process, starting at the first span with an empty hash

Other replay considerations

To mitigate XS changes that impact the execution, it may be possible to change the lockdown or supervisor bundles used when replaying the vat (see #6929 for validation of new XS versions)

  • these new bundles are meant to fix compatibility, not to introduce new features. They cannot cause diverging behavior with recorded transcript
  • these new bundles should be reflected in the vat transcript

Security Considerations

All validators should perform these steps independently. If they share the "offline" data with each other, the chain is vulnerable to corruption. This is not too much of a concern as this process is verifiable.

Since the hashes being recomputed would be captured in the swingstore export to cosmos DB, a super majority of validators must agree on the result of the replay to be identical for the upgrade to succeed.

Scaling Considerations

The replay of multiple vats can be performed in parallel to speed up the restart process.

The offline partial pre-processing allows speeding up the time needed to replay during the actual upgrade

Test Plan

TBD, but likely using the docker based upgrade testing framework, verifying various scenarios such as offline processing capturing partial (older) vat transcripts, or a vat being upgraded after the capture is made.

@mhofman mhofman added enhancement New feature or request SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset swing-store labels May 26, 2023
@FUDCo
Copy link
Contributor

FUDCo commented May 27, 2023

Minor nit:

we've basically hidden organic gc from liveslots

That's not quite right. Liveslots sees organic gc but then does various things to hide it from user code.

@mhofman
Copy link
Member Author

mhofman commented May 27, 2023

Nope, liveslots no longer sees organic GC because we couldn't trust liveslots to correctly hide organic gc impacts from the kernel (in which syscalls are made). We have always trusted liveslots to hide all gc (organic or forced) from user code.

@FUDCo
Copy link
Contributor

FUDCo commented May 27, 2023

Then what are those uses of WeakRef and FinalizationRegistry in the liveslots package doing?

@mhofman
Copy link
Member Author

mhofman commented May 27, 2023

They are only cleared our during forced gc (bringOutYourDead and snapshots). See #6784 (comment)

Edit: I updated the issue here to hopefully clarify the gc revealing story.

@FUDCo
Copy link
Contributor

FUDCo commented May 27, 2023

Thought: if a majority of a quorum of validators approves the results of a replay, the others could get the results via state sync rather than replaying themselves. If replays of different vats can be executed independently, you might be able to get some additional scaling by farming out different vats to different subsets of the validator population.

@mhofman
Copy link
Member Author

mhofman commented May 27, 2023

you might be able to get some additional scaling by farming out different vats to different subsets of the validator population.

Unfortunately for consensus, we're in an all or nothing situation. A single validator need to come up with all the right answers. There is no way to vote partially on the result.

@FUDCo
Copy link
Contributor

FUDCo commented May 27, 2023

Yeah, this would be something like a mainnet 4 thing, when we start branching off interweaving sub chains and whatnot for scaling. I could imagine entities bidding for which vats should get priority in upgrade much as we anticipate bidding for priority in message delivery.

@mhofman
Copy link
Member Author

mhofman commented Oct 3, 2023

A note that some changes to XS may end up having spec mandated execution differences, and thus directly observable by the program. While unlikely, this highlights that a replay based upgrade is not 100% foolproof, and that only an XS upgrade requiring a restart/upgrade of the vat is safe (see #8405). More details in #6929 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cosmic-swingset package: cosmic-swingset enhancement New feature or request needs-design swing-store SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

4 participants