streamStore must tolerate aborted transcript erasure during vat upgrade #4232

warner · 2021-12-30T23:16:12Z

What is the Problem Being Solved?

As part of the #1848 upgrade plan, the kernel will react to an update-vat run-queue event by:

terminating any existing vat worker (just as if the vat were idle and pushed offline in favor of an active one)
replace the vat's vNN.bundle kvStore entry with the new version of the vat's source bundle
replace vNN.options with a record that includes the new per-version vatParameters data
erase the transcript
launch a new worker, which will load the new code and execute the startup phase (importBundle and buildRootObject)
check that buildRootObject correctly fulfills its obligation to re-wire all virtual-object Kinds defined by its predecessor

If the new code cannot be evaluated, or buildRootObject leaves some Kinds undefined, the new vat is not viable. Rather than have this kill the vat, it would be great if the upgrade merely failed (leaving the old vat in place).

To accomplish this, we want all the state changes of the update-vat event to be unwound. We have lots of support for this during normal deliver and notify events, but update-vat erases the transcript, which is new.

Vat transcripts are held in the streamStore, which uses SQLite to hold a table of individual transcript entries. This SQLite database has a different atomic transaction window than the primary (LMDB) kvStore, so to avoid hangover inconsistency in the case of an abandoned crank, the protocol to add a transcript entry is:

streamStore appends the transcript entry, leaves the SQLite transaction open/pending
kvStore is modified with the new length of the transcript, change remains pending in crankBuffer
when crankBuffer is committed (e.g. a delivery did not terminate the vat), the crankBuffer deltas are flushed into the blockBuffer (a pending/open LMDB transaction)
when the host application survives long enough to commit the block, it calls swingStore.commit()
- this first does streamStore.commit(), which flushes the SQLite transaction to disk
- label this moment "point A"
- then it commits the LMDB transaction, writing the transcript length to disk
- label this moment "point B"

If the system crashes at point A, a later relaunch will still see the old (shorter) transcript length in LMDB, and the full transcript (with an extra entry) in SQLite. The kernel can replay the entire block if it needs to (all the vats can be reloaded from their earlier transcript), and at some point it will try to append the transcript as before, which will simply overwrite the extra transcript entry.

(this scheme was developed back when streamStore was writing transcript entries directly to an ever-growing file on disk, just one entry concatenated after the next)

The new problem is that erasing the transcript will not compose with failing to update the transcript length. If the transcript is erased, then recovery after a crash at point A will fail because the transcript is already gone. The kvStore is still in the old state (it will show the transcript as still being present), however the SQLite table cannot provide the entries.

This is a general problem of having multiple databases, with different commit points. One solution that we've considered (but have no easy way to implement) would be to find a versioned database: something where you could commit data and get version 2, but still have access to that data as of version 1 for a while. Then the last DB that commits would include the current version number of the earlier-committing DB. A recovery before the last DB commits would see the original state of all DBs. I suspect that cosmos-sdk's database works something like this.

The other solution, which we've used, is to design the interaction between DBs to allow the later-committing DBs to work correctly independent of whether the earlier-committing DB has committed or not. The last commit determines the overall state, but when the crash happens at point A, a old-version kvStore can tolerate an "echo from the future" that got committed to the streamStore SQLite table.

To fix this particular issue (transcript erasure), I can think of a few options:

Define transcript erasure as "kvStore transcript length = 0", and leave all the SQLite entries in place. We don't get to reclaim space for the old vat version (until the replacement version has enough cranks to finish overwriting all the old version's entries). Doesn't help us if the new version manages to do any cranks, since they'll overwrite the initial cranks of the old version in a way that doesn't get unwound if the block doesn't commit.
Make streamStore buffer transcript writes in RAM, just like crankBuffer (and how blockBuffer did back in the old JSON-based kvStore). Hm, this also doesn't help with a crank overwrite.
Introduce some additional layer of indirection between the kvStore entries and the SQLite tables. Currently we use the vatID in the SQLite rows (the table is (streamName, position, item), with streamName=vatID). If we used a separate counter instead, and remembered that counter in the kvStore, then "erase transcript" would become "increment the counter", and the old transcript would remain in SQLite, where a relaunch from point A could find it. This would be like a lazy+expensive form of the "versioned DB" above, in that all versions would be kept forever, which would be unfortunate.
move all of kvStore into SQLite, which would make the problem go away entirely
assume that the chances of reaching point A without also reaching point B are small enough to not worry about

The text was updated successfully, but these errors were encountered:

Tartuffo · 2022-02-02T22:23:09Z

Just make sure it is working correctly in MN-1.

warner · 2022-05-11T18:52:18Z

We changed our mind: vat upgrade will not delete the old transcript, rather it just sets a "starting point" counter. This effectively truncates the prefix of the transcript that was used for the previous version, without actually deleting anything, and behaves like a more-significant version of the existing counter that is tied to the current heap snapshot's starting point.

We figured that retaining the old transcripts would be necessary for a #1691 -style sleeper-agent replay that needs to span multiple versions.

We don't explicitly record the upgrade events in the transcript, but they're trivial to identify because each version's span of transcript entries is bracketed by a startVat and stopVat delivery.

We don't record the old bundlecaps anywhere. They don't currently fit with startVat (since startVat is a delivery made to an already-evaluated bundle), although if necessary we could either redefine startVat to take a bundlecap, or create a pseudo-delivery that arrives before startVat and means "here's a bundle, load it". OTOH, a #1691 replay wouldn't be using the original bundlecaps anyways.

@FUDCo implemented this counter in #3293 .

warner added enhancement New feature or request SwingSet package: SwingSet labels Dec 30, 2021

Tartuffo assigned warner Feb 2, 2022

Tartuffo added MN-1 and removed MN-1 labels Feb 2, 2022

Tartuffo added this to the Mainnet 1 milestone Mar 23, 2022

warner closed this as completed May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

streamStore must tolerate aborted transcript erasure during vat upgrade #4232

streamStore must tolerate aborted transcript erasure during vat upgrade #4232

warner commented Dec 30, 2021

Tartuffo commented Feb 2, 2022

warner commented May 11, 2022

streamStore must tolerate aborted transcript erasure during vat upgrade #4232

streamStore must tolerate aborted transcript erasure during vat upgrade #4232

Comments

warner commented Dec 30, 2021

What is the Problem Being Solved?

Tartuffo commented Feb 2, 2022

warner commented May 11, 2022