Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-deterministic GC #1872

Open
2 of 12 tasks
warner opened this issue Oct 15, 2020 · 40 comments
Open
2 of 12 tasks

non-deterministic GC #1872

warner opened this issue Oct 15, 2020 · 40 comments
Assignees
Labels
enhancement New feature or request needs-design SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Oct 15, 2020

(this tries to capture @dtribble 's "radical GC idea" from our 14-oct-2020 kernel meeting)

Task list:

  • chore: provide GC tools (WeakRef/FinalizationRegistry) to makeLiveSlots #1924 provide WeakRef/FinalizationRegistry to liveslots, or no-op stubs when platform does not have them
  • fix: add vatDecRef to kernel, offer to liveslots #1926 add kernel decref function (only logs), provide vat-specific vatDecRef to liveslots
  • fix: add WeakRef tracking to liveslots #1927 change liveslots import-side table (slotToVal) to retain weakrefs to Presences. Finalizers call vatDecref with count (excluded from replay fidelity checks). Keep strong refs to non-dropped objects. Tests: run with node --expose-gc, have the test program use gc() to provoke a drop, examine vatDecref calls.
  • add import-to-vat refcounts to c-list entries, increment in vatTranslator during delivery. Expose in c.dump(). Testing: examine c.dump() counts
  • introduce between-crank GC action phase
  • during GC action phase, execute vatDecref decrements. testing: examine c.dump() counts
  • when c-list refcount drops to zero, delete c-list entry. Do not propagate further yet. Testing: examine c.dump() clist entries.
  • enhance existing kernel table refcounts to include objects too
  • when c-list entry is deleted, trigger kernel-wide examination of refcounts. If export is the only remaining reference, delete OT entry, delete exporting c-list entry, stop there. Testing: examine c.dump()
  • add dispatch.dropExport to liveslots, delete slotToVal and exports entries. Testing: not sure
  • send dispatch.dropExport when deleting the exporting c-list entry

What is the Problem Being Solved?

@FUDCo estimated about 23 new kernel objects being created for each cycle of a basic "exchange" benchmark. Each of these objects consumes RAM in the liveslots object tables, as well as disk in the kernel clists.

Many of these objects will be Purses or Payments or other "core objects", which will be moved to secondary storage (disk) via #455 and some kind of makeExternalStore-style constructor. However there are likely to be several which aren't so core (e.g. a Notifier), and aren't so likely to migrate: it would be a significant ergonomics burden if we asked programmers to make sure every single object they created and exported were put into an external store.

The remaining objects will cause memory pressure, perhaps less than 23 per cycle, but still a significant problem. We need a GC solution. Vats which import an object and then cease to reference it must signal the kernel (somehow) that the entry is no longer needed. The importing vat's c-list entry should be removed. When all clients of a kernel object (c-lists and resolution data in the kernel promise table) are gone, the exporting vat's c-list entry should be removed, and the exporting vat should be signalled that the exported object can be dropped. The exporting vat should remove the entry from its liveslots tables, allowing the original object to be dropped.

The big problem, which has blocked us from making much progress on this front, is determinism. JavaScript now has WeakRefs, which let us attach a finalizer to fire when an object is dropped, but it does not many any guarantees about when this finalizer runs. It could be immediately after the last strong reference is dropped, or a minute later, or a year later. In a chain-based swingset machine (i.e. cosmic-swingset), if the GC behavior is observable, consensus among multiple validators (with perhaps different memory pressures) could not be reached.

Previous Plan (which probably wouldn't work)

The setup: suppose vat A is a client of some object in vat B. So vat A has a Presence that wraps some imported object ID (o-4), vat A's c-list maps o-4 to a kernel object ko5, vat B's c-list maps ko5 to an exported o+6, and vat B's liveslots tables map o+6 to e.g. an exported Purse object. Vat A's liveslots tables manage the import by having slotToVal (a strong Map) map o-4 to the Presence, and valToSlot (a WeakMap) map the Presence back to o-4, and the initial problem is 1: we aren't using a WeakRef anywhere, and 2: the strength of slotToVal would prevent the Presence from being collected even once the vat drops it. On the exporting side, Vat B's liveslots table has slotToVal mapping o+6 to the original callable Purse object, and vatToSlot mapping the purse back to the string o+6.

Previously, I was expecting a system in which liveslots uses WeakRef to detect Presences becoming unreferenced: we change slotToVal to use values which are new WeakRef(Presence), and add a finalizer/notifier (FinalizationRegistry) to sense when userspace has dropped it. At that point, liveslots removes the slotToVal entry (knowing the valToSlot entry is already gone), at which point it would send a syscall.drop('o-4') to the kernel. The kernel reacts to that by removing the vat A c-list entry for o-4, which then decrements the ko5 reference count. If/when all other importing vats (and any resolved promise data) cease referencing ko5, such that the only remaining reference is from the exporting vat (i.e. the kernel object table records vat B as the owner of ko5, and the vat B c-list is the only remaining reference), then the kernel deletes ko5 from the object table and delivers a dispatch.drop(o+6) to vat B. When this is delivered, the kernel deletes the vat B c-list entry, delivers the DROP, and then vat and kernel agree to "never speak of this ID again", just like they do for retired promise IDs.

If it weren't for the nondeterminism of the FinalizationRegistry, this would probably work. But the timing of finalization calls is far to uncertain for us to rely upon in a consensus machine. Even if we require all validators to run the same version of the same engine, the behavior will depend upon memory pressure and perhaps the past history of the engine (heap fragmentation, etc). We aren't even sure we could use this in a non-consensus machine, because simply restarting the solo node (replaying the transcript) might result in different finalizer calls on the second time through, breaking consistency with the previous transcript.

What's New

@dtribble 's question was: what if GC wasn't visible to consensus? By relaxing the definition of consensus, we would allow vat execution to vary somewhat between validators: the messages they send must still be identical, but their GC decisions don't.

In this scheme, we'd still have liveslots use a finalizer to sense the client dropping their last reference. Vat A would still notify the kernel about the DROP, but it wouldn't do it with a syscall, or at least the syscall it uses would not be part of the consensus state. Our transcripts record a list of deliveries, each with a list of syscalls made during the processing of that delivery. Vat replay means re-submitting each delivery, and comparing the syscalls made by the new vat against the ones recorded by the old one. In this case, we'd exclude syscall.drop from this comparison, allowing divergence between the previous run and the new one.

To be precise, we should distinguish between several different places where we compare the behavior of one execution trace against another, for different reasons:

Orthogonal Persistence

Orthogonal persistence is accomplished by starting from some initial state (possibly a #511 heap snapshot) and then replaying a transcript of previous deliveries. Each delivery includes a set of syscalls that were made in the first pass, along with the results of syscalls which include results (device reads, secondary-storage reads). We don't intend to give vats the ability to behave differently on the second pass, but to detect bugs in the surrounding code earlier, we want to detect replay divergence as soon as possible, so we currently compare the syscalls and their arguments against the transcript, and raise a vat-fatal "anachrophobia error".

For the narrow purpose of replay, the transcript really only needs to retain the deliveries, and the return values of the syscalls. When we add early divergence detection to the list of requirements, it would suffice to include a hash of the syscall arguments. In both cases, we can truncate the transcript each time we take a heap snapshot. (Note that we may record the full transcript, with all data, for debugging purposes, in particular to rebuild a vat locally under a debugger, and/or with additional logging enabled. This record is excluded from consensus, is generally write-only, and is independent of the one used for orthogonal persistence.)

Validator Consensus

In a high-credibility machine (i.e. one embedded in a blockchain), the behavior of the composite machine should not depend upon the behavior of any small subset of the validators. Only decisions made identically by a suitable majority (typically 2/3rds) of the validator power should be recognized by clients as authentic behavior of the "chain" as a whole.

But not all decisions need to fall into this category. Validators do not expose all of the inner workings of the shared application. In Cosmos, only data that is stored into the Cosmos state vector needs to be identical across validator nodes (and only that data can be validated by clients: anything else a client might be told will not have the same credibility as something in the state vector).

Our current rule is that the swingset kernels (and the vats they host) can "think" whatever they want, but anything they "speak" must meet consensus. The swingset-on-cosmos kernel writes outbound messages into the cosmos state vector for delivery to other systems (either through IBC or our initial custom protocol). Different validators could conceivably have vats which behave in different ways, as long as the comms-layer externally-targetted messages they create are identical. In practice, however, if the vat internals diverge, the messages are likely to diverage as well.

Fast Validator Catch-up

(This might be known as "fast sync" or "smart sync" in some environments.)

When a new validator joins the chain, it needs to acquire a set of vats which have the same internal state as the existing validators, so they can evolve in the same way, so the validator can produce blocks that the other validators will accept.

The slow-but-steady way to achieve this is to replay the entire chain from the beginning. The new node must find someone willing to feed it with a list of all blocks, starting with the genesis, and replay the transactions in each one. This causes the application state to evolve, one transaction at a time, producing new blocks. At each step, the node checks the consensus signatures and confirms that a threshold of validators has approved the block, and that the locally-generated block matches the one approved by the existing validators. A subset of these blocks will contain messages aimed at the swingset module, and processing these messages will cause the swingset kernel+vat state to evolve, following the same path (and generating the same outbound messages) as the swingset module in those earlier validators.

If the chain was mostly CPU bound the first time through, the new validator will probably be equally CPU bound. They have the advantage of not needing to wait for new blocks to be produced, but their only chance to catch up will be through the idle gaps in the earlier block history. As a result, new validators may take an extremely long time to reach the current state and become prepared to produce new blocks of their own.

The faster way to get caught up is to receive a pre-evolved state vector from some existing validator. If the data included the entire application state, and If the new node could rely upon the correctness of that data, it could simply drop the state vector into its empty and then pretend that it had just rebooted after being fully caught up. It would not have to replay anything, and a new validator could be brought online a few moments after receiving the state vector.

Those caveats point to two conditions that must be addressed. The first is completeness. Swingset does not keep all its state in the cosmos state vector, in fact much of it is not kept on disk at all. The majority of the swingset state is in the RAM heap of the JavaScript engine, dispersed among all the vats. The rest is in the LMDB-based "KernelDB", on disk, which holds the c-lists and vat snapshots/transcripts. The RAM state can be regenerated, however, from the vat transcripts. So, given extra startup time, we can pretend that all swingset state is kept on disk. When catching up, we can take a faster path by starting from a vat snapshot, or by lazily loading of vats (prioritizing the active ones, and deferring the idle ones). We can also parallelize vat reloads in multiple CPU threads, because each vat runs independently.

The second condition is the new validator's ability to rely upon the state vector it has received. This vector must be signed by an appropriate majority of the validators, otherwise a new validator could be tricked by a minority subset, violating the credibility properties. It is not sufficient for the other validators to merely examine and approve the output messages of the new validator: this demonstrates that the current behavior is correct, but does not protect against future divergence. In fact, if the malicious state provider successfully convinces the new validator to accept a state bundle that includes an #1691 -style "sleeper agent" (which doesn't activate until much later), and the new validator is, in turn, relied upon by other new validators to get that state bundle, a patient attacker could manage to eventually convince all validators to believe a state vector that does not match the original, and obtain arbitrary control over the entire chain.

For this reason, whatever fast-sync state vector a new validator might use must be subject to the same consensus rules as the blocks used to derive that state. In practice, this means all validators must periodically produce the vector themselves, and include its hash in their blocks. For these to be compared, all validators must produce the same vector (anything that goes into this vector is "consensus-critical").

To use this for swingset, the vat snapshot, transcripts, kernel c-lists, and kernel object/promise tables must all be identical among all validators.

Description of the Design

So, the proposal from today's meeting is:

  • the liveslots slotToVal table for imported objects will map object IDs to WeakRefs (of Presences) rather than mapping to the Presence directly
    • (I think exported objects must be retained by strong references, which suggests we might need two tables)
  • a FinalizationRegistry is used to sense when the target Presence has been GCed
  • the registry's cleanup callback will:
    • remove the slotToVal mapping
    • perform a syscall.drop(objectID) call, to inform the kernel that the entry is no longer needed
  • the kernel reacts to syscall.drop by removing the entry from the vat's c-list, and (immediately? eventually?) performing a GC sweep through the c-lists
  • at some point, this may result in the kernel object ID becoming unreferenced by everyone except the exporting vat
    • the kernel will immediately deliver a dispatch.drop(objectID) to the target vat (bypassing the run-queue, so as to not give the target vat an opportunity to emit a new reference to the about-to-be-deleted export, which would need to cancel the drop)
    • the kernel will translate the kernel object ID into the vat ID, delete the entry from the c-list, then deliver the drop(vatObjectID)
  • the receiving vat's liveslots will react to the dispatch.drop by deleting the slotToVal table entry, which will drop liveslots' reference to the original exported object
    • the valToSlot WeakMap may still include a (weak) reference, if other code within the vat retains it
    • therefore, if the vat ever re-exports the same object, it will re-use its original object ID, un-dropping the ID. As long as both sides of each vat-kernel boundary agree on the drop and export events, this shouldn't cause a problem, but it might be better to somehow invalidate the old object ID and allocate a new one (which would probably involve changing valToSlot to point at some cell instead, retaining a strong slotToCell mapping, and storing an isValid flag in the cell next to the objectID).

Consensus Consequences

For orthogonal persistence purposes, we must not include the drop syscalls in the replay comparison check: a replay delivery is considered successful if it makes exactly the same non-drop syscalls that the transcript recorded. This means filtering out any drop syscall before checking the transcript.

Should transcripts include the dispatch.drop deliveries, such that replayed vats are told to drop exports in the same way their predecessor was told? Certainly they need to be told eventually, otherwise they'll use more RAM than their predecessor did. But we can tell them earlier or later without problems. One suggestion was to have replay watch the syscalls and build up a list of exported object IDs, then compare it against the current c-list, and immediately submit a bunch of drop deliveries for the missing ones (at the end of replay, before any new messages are delivered). When we move to snapshot+transcript, we might track "dropped exports" in a special set, so that the current (minimized) vat state can always be reconstructed by 1: loading the heap snapshot, 2: replaying the truncated (post-snapshot) transcript, and then 3: delivering drop for everything in the set. We can truncate the drop set when we make a new snapshot.

For validator consensus purposes, where we only put externally-targetted messages into the Cosmos state vector, it doesn't matter than some validators have a drop where others do not. As long as we aren't putting hashes of kernel (c-list) state or vat heap snapshots or drop-bearing transcripts into that state vector, validator behavior can safely diverge along the narrow drop/not-drop axis.

For fast-sync purposes, we have a problem. Heap snapshots where GC has happened (especially where dispatch.drop has been delivered) will be completely different than in vats where it has not. Likewise c-lists will be missing entries where drop took place. These states will differ between validators. As a result, the data we use for fast-sync cannot use heap snapshots if drop is in use. We could work around this by only fast-syncing full from-the-beginning transcripts (which we would have to retain for this purpose, since normally we would truncate the transcript each time we took a heap snapshot).

The varying c-lists, however, is harder to deal with. The full swingset state includes the kernel object/promise tables, the c-lists for each vat, and the snapshot+transcript for each vat. Validators can't agree on the kernel state if the c-lists will depend upon whether a given vat did GC and emitted a drop or not.

@dtribble pointed out one workaround would be to abandon vat-at-a-time replay and instead/additionally record a totally-ordered full-kernel transcript (basically a list of every item added to the run-queue). Replaying this list would regenerate the c-lists as well as the internal vat state, which would remove the need to publish (and agree upon) the c-list contents. The cost would be:

  • validators must store information for fast-sync that would normally be discarded
  • fast-sync could not use snapshots to avoid playing an entire vat's transcript
  • fast-sync could not replay multiple vats in parallel
  • fast-sync could not lazy-load vats
  • fast-sync would take nearly as long to complete as the original runtime took to generate the kernel history: cosmos operations would be short-circuited, but every swingset operation must be repeated

If we went that direction, we could have a system in which "fast-sync" uses one set of data and "reboot catch-up" uses a different set. As @erights pointed out, nodes can safely rely upon their earlier history more so than upon data provided by outsiders. So the slow processing of "fast sync" might be a necessary compromise to allow GC to proceed.

Additional Considerations

Vats should not get access to the WeakRef or FinalizationRegistry constructors, so they should not be able to sense objects being GCed. No user-level vat code should run in reaction to a dispatch.drop: only liveslots will have control.

How should this interact with secondary-storage "virtual objects" (#455)? Ideally, when the exporting vat receives the drop, it should delete the relevant entries in secondary storage (which might cascade into deleting other virtual objects, somehow). However, even if the secondary-storage syscalls are not consensus-critical, the DB entries which back them are just as consensus-relevant as the vat's c-lists and transcript. This suggests fast-sync cannot include secondary storage either, since it may be missing contents that were released by GC, pushing further into the "replay a full kernel transcript" approach.

Comms vat: when GC is distributed across swingset machines, the drop message becomes externally visible (and thus subject to much stricter consensus rules). This is the point where we need deterministic behavior. I fear this is "purple box" -level problem (thesis-level complexity).

Security Considerations

We have to be very careful to not allow user-level code observe any non-determinism exposed by the GC mechanism, through a combination of:

  • vat code will not have WeakRef or FinalizationRegistry in their globals (SES should filter these out from non- start Compartments because of their nondeterministic nature, just like Math.random and Date.now), so they cannot sense GC directly
  • liveslots receives drop messages and acts upon them, but must not allow that to change the behavior that vat code can observe

The consequences of a leak would be that vat code (probably deliberately written to exploit the leak) could cause consensus failure: some validators would be slashed for behaving differently than others. If the divergence is severe enough, the chain could halt entirely if too few validators were able to agree upon new blocks.

The fast-sync state must be validated correctly. If an attacker who supplies alleged state data can get a new validator to accept malicious state, they can control the output of that validator, and eventually (with enough patience and time for the bogus data to spread) that of the entire chain.

@warner warner added enhancement New feature or request SwingSet package: SwingSet labels Oct 15, 2020
@warner
Copy link
Member Author

warner commented Oct 15, 2020

How should the kernel keep track of which imports are still being used?

For each vat, the deterministic ("userspace") computation has a specific set of live imports: the Presence provided to userspace is either reachable from userspace or not. This reachability is the same across all executions of the same vat during that same delivery, regardless of which validator it is on, or how many times the vat has been reloaded from snapshot and replayed from transcript. Let's call this the "actual liveness".

The liveslots layer, using its WeakRef and FinalizationRegistry, senses a slightly different property, which we'll call "apparent liveness". An import which is actually dead may still be apparently alive, because the finalizer has not yet fired.

For each imported object (o-NN), the kernel can track when that object was provided to the vat: this marks the start of the liveness window. The kernel will also track when liveslots emits a drop message, which marks the end of the apparent liveness window (the actual liveness window is not visible to either liveslots or the kernel, but it is bounded by the apparent window).

If, on one run, the apparent window begins at delivery 4 (the kernel includes o-NN in a delivery argument for the first time), and ends at delivery 8 (when the vat does a drop syscall), the kernel knows that 1: the actual liveness window ended somewhere in delivery 4 through 8 (inclusive), and 2: the object is actually dead in delivery 9 and later. It is vat-fatal for the vat to reference the object ID in syscalls made during delivery 9 or later, unless the kernel re-introduces the ID in some new delivery. Let's pretend that we know the actual window ended during delivery 6, and it took two more deliveries before the GC infrastructure notified the finalizer.

Now, suppose the vat is reloaded from transcript. On this second run, the drop message appears during delivery 7. If the kernel remembered the previous run, it would know that the actual liveness window is still bounded by the earlier of the edges. If, in a third run, the drop appears during delivery 6, that narrows the window. If the kernel has ever observed a run where the drop appeared during delivery 6, then it could be vat-fatal to mention the ID during delivery 7, even if the new run hasn't yet emitted a drop. Except, ugh, vat-fatal errors are made visible to other vats, so we must be careful to not use this extra knowledge in a way that could reveal it to vats.

What I'm trying to sort out is what exactly the kernel needs to track. I'm thinking that the kernel is allowed to delete the import from the c-list as soon as it sees the drop. The c-list is not versioned: we only retain the latest copy. When the kernel reloads the vat, if the drop is emitted earlier this time around, that's fine, because the syscalls are being ignored during replay anyways. If it is emitted later, the kernel will observe a drop for an object ID that isn't currently in the c-list (because it was deleted during the previous run), so we must tolerate that case (silently ignore).

On the other edge of the window, if the kernel sees a syscall with an object ID that isn't in the c-list, that definitely means the vat is making an illegal access. But I think there are three cases:

  • the vat is accessing an object that it was never given, in which case a vat-fatal error is totally appropriate (it is a deterministic function of the vat's behavior, without the influence of GC timing which the vat is not supposed to be able to sense)
  • it once had access to the object, but it gave up that access by emitting a drop earlier during this run. This is clearly illegal behavior, but depends upon GC timing.
  • it once had access to this object, and the kernel has not yet seen a drop in this run, but on some previous run the vat did emit a drop, so that previous kernel removed the c-list entry. This is illegal behavior, but the kernel only knows that because it remembers GC activity from the previous run. The new kernel has contraband information from a previous run, giving it a narrower bound on the actual liveness window.

How do we keep the kernel from accidentally revealing its knowledge of GC activity (to other vats) when a vat references a previously-dropped import? If the kernel deletes the c-list entry when it receives the drop, it won't have enough information to handle a later syscall which references that ID. So it must either kill the vat, or retain the supposedly-dropped c-list entry
so it can pretend that the object is still alive.

What exactly is the pathway by which the kernel might accidentally reveal its forbidden GC knowledge? If the kernel kills the vat when it references a previously-dropped object ID, that will have two effects: a message is sent to the vatAdmin (to notify the vat's parent, by resolving the done promise), and all subsequent messages sent to the dead vat's objects will reject (with a VatTerminatedError). These messages get pushed onto the run-queue, which is part of the saved kernel state (for local reload), but not part of the observable consensus state (validators agree on externally-targetted messages, not the internals).

However, once those termination messages are delivered, they may provoke externally-targetted messages, which is part of the validator consensus. So all validators must choose the same time to kill the vat, give or take some wiggle room: I think it depends upon the order of other messages currently in flight. If one validator's vatAdmin were notified one crank later than the other validators (still within the same block), but there were no external messages happening about the same time, then an outside observer might not observe a difference in the external message ordering, and might not notice the divergence. However we can't depend upon that.

So this is making me think the kernels really do need to agree upon when the vat is dead: the vatAdmin message must be delivered on the same crank across all validators. That limits our options. The kernel's drop-gleaned GC information is not admissible evidence for the crime of illegal syscalls. The kernel must pretend that it didn't see the drop, which means it needs to keep the c-list entry around (so it can route the illegal message).

That points to a new "live" column in the kernel object table. drop clears this flag, and triggers the delivery of drop to the exporting vat, but isn't allowed to influence the illegal-syscall check.

I'm still pondering what this means for the exporting vat.

@warner
Copy link
Member Author

warner commented Oct 16, 2020

@FUDCo and I discussed this some more. Both the "when should a vat be killed for an illegal syscall" question and the "when should the comms vat send a drop" question are troubling: they both represent consensus-visible behavior that depends upon GC timing that won't be consistent.

One wild idea we explored was a "drop oracle". The idea would be that the kernel somehow reacts to a drop syscall by telling an external program (maybe something in cosmos-sdk, but not part of consensus) that this particular validator believes o+NN has been dropped. This program then creates a signed transaction message to that effect, and submits it back into the chain. As an externally-introduced transaction, all validators will see it. The message is routed to a voting/governance cosmos-sdk module that knows about the validator staking/power, and that module accumulates drop messages until it sees a sufficient threshold of votes reached. At that point, the voting module submits a "drop the following object IDs" message to swingset, which then does the decref and maybe tells the exporting vat to drop if it was the last reference.

Vat termination for illegal syscalls could maybe be done the same way. Effectively all nodes vote a reference (or an entire vat) off the island. Death by committee.

The downside is, of course, that we're kind of inventing our own consensus layer, and that was on our list of things we shouldn't do. Drafting off the existing voting module probably reduces the hazards, but it still raises the question of whether the incentives are correct. If one validator maliciously claims that they've seen a drop from some vat, when nobody else has (and in fact their vats are still talking about that import in syscalls), what are the consequences? How would anyone else even tell that they're doing something malicious? Maybe the validator must periodically vote with a list of all the object IDs that they claim to have seen dropped, and equivocation can be grounds for slashing. (Does the existing cosmos-sdk voting/governance module have provisions for slashing to discourage vote equivocation? Or are votes one-sided and the only proof needed to carry the measure is a plurality of "yes" votes, because you can't even cast a "no" vote?)

@dtribble
Copy link
Member

The only viable answer is that it cannot make an illegal syscall.

If vat A exports an object O under kernel key K1, the incoming drop means there are no references to O from outside A. That was true at the end of the actual liveness window, above. So no external event can cause a use of the kernel's key for A. But there may be references inside A. So the only problem is if A exports O again. Does it reuse K1 or does it get another key?

A simple fallback answer is that it uses K1. When the drop arrives, liveslots doesn't drop the reference to O, it moves it to a weak pointer. Thus, as long as O is retained in A, if it's exported, it will be exported with the same K1. Only when O is dropped in A as well will the kernel be told it will never see K1 again.

I think we can do better than that, but the fallback is implementable. Foe example, resolved promises will never be exported again. It's really only presences for which this could be an issue.

@warner
Copy link
Member Author

warner commented Oct 16, 2020

That sounds like a decent approach for the exporting side. I don't think it matters too much whether we reuse the exported ID or allocate it a new one. Although we need more bookkeeping if we want to allocate a new one (the valToSlot WeakMap might still remember the old ID, but in this case we're not supposed to use it, so we'd need to change that map to point to a record that is also pointed to by a strong map, and contains a flag that says "don't use me, allocate a new ID instead"). OTOH if we allow the ID to be reused, that might complicate some kernel-side data structure that keeps track of which IDs are alive or not.. haven't thought about that structure yet. My hunch is that reusing IDs will be simpler overall.

My concern has mostly focused on the importing side. But I think it's safe to say that illegal syscalls can only happen if the liveslots layer is buggy. If liveslots is functioning correctly, then even a malicious userspace should not be able to trigger an illegal syscall. So it's a question of how we define the TCB. I don't want to give liveslots the authority to break the integrity of any other vat than its own, but maybe our fallback position is that we do give liveslots the authority to break the swingset as a whole (by making syscalls whose legality depends upon GC-sensitive nondeterminism), and rely upon our implementation to not exercise that authority.

I don't see how that helps us with cross-swingset (comms-vat) drops, alas.

@warner
Copy link
Member Author

warner commented Oct 16, 2020

Hm, the TCB argument is stronger than I realized: of course we're giving liveslots access to nondeterminism, because that's the only way to get GC information about what userspace is doing. So of course we're forced to rely upon liveslots not misusing that authority. Which means I think we can stop worrying about a buggy liveslots causing divergence, because we don't have any other choice.

So that leaves comms-vat drops. Ugh.

@warner
Copy link
Member Author

warner commented Oct 16, 2020

Oh, and we probably can have the comms-vat drop references from a solo machine, which might mitigate most of the problem: how frequently do our solo-side clients provide objects to the chain, rather than the other way around?

@warner
Copy link
Member Author

warner commented Oct 16, 2020

On the export side, I think we should go with arrival-order nondeterminism.

Each drop syscall will check the c-list of the no-longer-referencing vat. If present, it will remove that entry and add the kernel object ID (koNN) to the "maybe-free" list (which lives in the persistent DB). If not present, the syscall is ignored (this will happen when we did a suspend/reload since the import was visibly dropped, and the new vat's finalizer happens to run later than the previous one did). We want to wrap up our interaction with the previous vat before we do anything involving other vats, so at the end of the crank, we commit the DB as usual. In addition to the normal crank buffer DB changes, for each non-redundant drop, we expect to see one addition to the maybe-free list, and one removal from the c-list (of the dropped import).

Then, before moving on to the next item on the run-queue, we pop each item from the maybe-free list and search for remaining references. This references might come from other vat c-lists (worst case it could scan them all, best case we maintain a reference count), or from the resolution/rejection data of unretired promise table entries. If any references still exist, the kernel does nothing. If there are none, the kernel reads the owner (exporting) vat of the object, deletes the object table entry, and performs a delivery of dispatch.drop to the vat.

This delivery will look up the target kernel object (koNN) in the exporting vat's c-list, delete that c-list entry, then call the dispatch.drop function of the target vat with the corresponding vat object ID (which should always be an export, hence of the form o+NN).

This delivery is expected to cause liveslots to delete the slotToVal (strong) Map entry for o+NN. (We may want to rename this map to something with exports in the name, to distinguish it from the weak references held by the import map). If nothing else in the vat is holding on to that exported object, it will become eligible for collection, which might cause other objects to be released, which might cause the vat to drop some imports. Hence the delivery of drop might provoke new syscall.drops to be emitted during delivery. These new drops will be removed from the c-list (if present) and pushed onto the maybe-free list for later processing, like above.

The dispatch.drop delivery is added to the exporting vat's transcript as usual. If/when this vat is rebuilt from snapshot+replay, we must re-delete the strong Map entry. The order of the deleting is fixed by the kernel when it decides to inform the exporting vat of the drop (arrival-order nondeterminism).

Once the dispatch.drop delivery is complete, the kernel commits the crank as usual. The contents of the crank buffer are expected to be:

  • one removal from the maybe-free list
  • if that was the last reference:
    • one removal from the exporting vat's c-list
    • one dispatch.drop addition to the vat's transcript
      • no syscalls other than syscall.drop: userspace should not get control
    • if that caused some imports to be dropped:
      • additional removals from the c-list
      • additions to the maybe-free list

The kernel should probably drain the maybe-free list before returning to the run-queue.

bikesheds:

  • name for the import-dropping syscall
  • name for the export-dropping delivery (it might be nice to have a name that's different from, but clearly related to, the syscall)

minor design questions:

  • layout of the new (split) import/export tables
  • persistence of the maybe-free list

larger design questions:

  • I'm thinking this should strictly be for Presences, and should ignore Promises, at least for now, because Promises can be resolved and then retired (which handles some portion of our GC needs), and we need to figure out some additional design:
    • does a .then callback keep the Promise alive? We'd need that to be true, to safely use a WeakRef on the import-side Promise to know when we can drop it
    • we'd need to examine the subscription tables and make sure they're updated correctly
    • the notion of "import" and "export" is less clear, and we need our refcounts to only measure imports, not whatever "export" means. (if a promise resolves in the woods, and nobody is around to hear a callback, did it make a reference?)
    • however, being able to immediately drop an unused result promise would be awfully nice

@warner
Copy link
Member Author

warner commented Oct 16, 2020

Hm, if we did apply this to Promises as well, then a crank which does an implicit sendOnly (i.e. x~.foo(); rather than x~.foo().then(stuff); or const saveForLater = x~.foo();) would potentially drop the result Promise so quickly that the finalizer would fire in the same crank, which would be ideal. The syscalls we'd observe would be:

  • syscall.send with result=p+NN
  • syscall.drop(p+NN)

The send will create a promise-table entry whose "decider" is set to null (because the send on the run-queue is the decider), and lists the sending vat as the only subscriber. The drop should ideally delete the promise-table entry and change the run-queue entry to remove the result= field, since there's only one subscriber and they just unsubscribed. When this crank is committed, the DB changes should be indistinguishable from ones made by a vat that had done an explicit sendOnly (with result=null and no drop).

If the sender actually uses the Promise (perhaps they pipeline a message to the result, or they share the result promise with some other vat), then this is less clear. Pipelined messages whose ultimate result promise is dropped should also drop the intermediate results. Shared promises should not be dropped until the last subscriber has dropped.

The "imports" of a promise-table entry are the subscribing vats, plus the run-queue messages that target it. The "exporter" of a promise-table entry is the deciding vat, if any, or the run-queue entry which holds that promise as a result.

@FUDCo
Copy link
Contributor

FUDCo commented Oct 16, 2020

Although we need more bookkeeping if we want to allocate a new one (the valToSlot WeakMap might still remember the old ID, but in this case we're not supposed to use it, so we'd need to change that map to point to a record that is also pointed to by a strong map, and contains a flag that says "don't use me, allocate a new ID instead"). OTOH if we allow the ID to be reused, that might complicate some kernel-side data structure that keeps track of which IDs are alive or not.. haven't thought about that structure yet. My hunch is that reusing IDs will be simpler overall.

My intuition is the opposite: that generating new IDs will be much simpler, because it means we can retire old ones when we think they're dead instead of having to keep track.

@warner
Copy link
Member Author

warner commented Oct 18, 2020

Task list (obsolete, use checklist in first comment instead):

  • fix: add WeakRef tracking to liveslots #1927 change liveslots import-side tables to retain weakrefs to Presences. Finalizers log but nothing else. Probably need to split slotToVal/valToSlot into separate import- and export-side maps, to accomodate different strong/weak needs. Tests: run with node --expose-gc, have the test program use gc() to provoke a drop, manually inspect the log.
  • fix: add WeakRef tracking to liveslots #1927 add syscall.drop. kernel-side handler ignores it. drop is excluded from replay fidelity checks (or excluded from the transcript entirely, must decide). Testing: use gc() from test program, assert syscall.drop is in transcript (or other test-visible record)
  • syscall.drop deletes importing vat c-list entry. Testing: non-liveslots vat to call syscall.drop directly (no gc()), examine clist after gc().
  • implement maybe-free DB list, syscall.drop adds to list, list is serviced before run-queue, but "service" is a no-op. Testing: kernel-side, no liveslots
  • implement refcount check, service-maybe-free determines if object is now only referenced by exporting vat, but action is to merely log. Testing: kernel-side, no liveslots
  • implement dispatch.drop, but liveslots ignores
  • implement liveslots dispatch.drop by deleting exports-map entry. Simultaneously change kernel implementation to delete exporting-vat clist entry. Testing: not sure, maybe manual.

@warner
Copy link
Member Author

warner commented Oct 20, 2020

Hm, I just thought of a wrinkle: the FinalizationRegistry might run the notification callback at any time, which means it might be called in between cranks, or when the vat is not supposed to be active. That means we can't immediately send a syscall.drop when it fires: instead we must wait until the vat is supposed to have agency, which means when the next inbound delivery is made. (Liveslots doesn't learn when the crank is finished, merely when the crank is begun, and when invocations are made which generate syscalls).

Worst case, a vat which is told to shut down (and politely deletes all the internal references it can, to accelerate GC) might not actually see the finalizers run within that same crank. And, if nobody ever talks to that vat again, it won't ever get agency again, which means we won't have a safe place to emit the drop syscalls, leaving that vat's references fixed in place, unable to be collected. Ugh.

One slight mitigation would be to have the notifier set a flag and check that flag each time liveslots gets control (so each time an outbound message is sent). That might increase our chances of getting the drop out, but it's no guarantee.

Another approach would be to add an explicit delivery type whose only job is to check the finalizer flags and report any dead references. We might call this immediately after every crank finishes. I don't like the overhead this would represent, but it's hard for me to resist the idea of adding a perfectly-named dispatch.bringOutYourDead method. It might be sufficient to call it once every N cranks or something.

It might also be possible to blur the line between vat and kernel a bit more, and have these notifications go directly to the kernel, rather than being handled entirely by liveslots. In particular, if the kernel didn't learn about them through a syscall (but rather through some other channel that travels the same kernel-worker protocol as syscalls), then it might be ok for them to occur outside the usual "only during a delivery" window. This would also remove them from the transcript entirely, which means one fewer exception to implement.

@erights
Copy link
Member

erights commented Oct 20, 2020

This would also remove them from the transcript entirely, which means one fewer exception to implement.

Would the syscall interface go back to carrying only deterministically logged things? If not, what are the other exceptions?

@erights
Copy link
Member

erights commented Oct 20, 2020

Worst case, a vat which is told to shut down

Is shutting down distinct from terminating?

If a vat is terminating, why do we care what drops it reports before terminating. Terminating will deterministically drop everything anyway, yes?

@warner
Copy link
Member Author

warner commented Oct 20, 2020

Would the syscall interface go back to carrying only deterministically logged things? If not, what are the other exceptions?

Yes, I think if we moved syscall.drop to some other pathway, then all syscalls would be deterministic again (so the transcript could contain, and check, all syscalls, removing the exceptional case where we must strip out drop syscalls before performing the comparison during replay).

Worst case, a vat which is told to shut down

Is shutting down distinct from terminating?

Yeah, I didn't mean termination, I meant something smaller, in which the vat as a whole keeps running, but the code inside it knows that it's job is complete, so it does as much cleanup as it can (empty all the Stores to facilitate GC) while is still has control. More like emptying out the fridge before a holiday of indeterminate length, rather than burning down the entire house.

It might not be that common/useful of a case. Maybe an auction contract that we keep around for a long time but which only gets used occasionally, and which builds up a lot of references to external objects (which we'd really like to drop) when it does get run. Imagine the contract handles a dozen messages, and on the last one, it knows that this particular invocation is complete, so it drops all those references. If the finalizers don't run until after the vat loses agency, it won't have a chance to tell the kernel that it doesn't need them anymore, and we won't be able to GC upstream until the next auction happens.

If a vat is terminating, why do we care what drops it reports before terminating. Terminating will deterministically drop everything anyway, yes?

Agreed.

@erights
Copy link
Member

erights commented Oct 20, 2020

This reminds me of the other spill-to-disk persistence scaling problem I've been meaning to raise, that it sounds like you've already been thinking about:

Just like we assume that there are a moderate number of active purses at any time, but a zillion idle purses that should not be occupying any ram, we should make the same assumption about vats. That there are a moderate number of active vats at any time, but a zillion idle vats that should not be occupying any ram.

Likewise, most intervat capabilities will be from an idle vat and to an idle vat, and therefore also themselves idle. These idle caps between idle endpoints should not take ram bookkeeping space in the kernel, in comms vats, or in captp-on-ibc. How much of this is already implied by our current plans?

@warner
Copy link
Member Author

warner commented Oct 20, 2020

Yep, the heap-snapshot and worker-process design is meant to support vats being demand-paged into RAM. We'll have the option of loading any given vat as soon as the process starts, as late as when someone sends a message to them, or some point in between. The actual policy/heuristic is an open question: the concern is that the vat might take a significant time to load (replay), and if we wait until the last moment, block processing might stall while we catch up. We might be able to peek at the escalators to make a guess about what deliveries are coming up, and use any leftover time to load the vats while the messages are still climbing. OTOH, if there are any messages at all remaining on the escalators at the end of the block, that indicates we're sufficiently overloaded that we can't keep up (we're now rationing/allocating execution time), so we might not have any spare CPU cycles to load vats. OT3H there will be multiple CPU cores, and we might not be able to use all of them for normal message processing: maybe we wind up with one (serialized) core taking things off the escalator, leaving the other cores available to page in vats which have messages on their way.

The inter-vat caps will consume space in:

  • client Presence (import): RAM, unless it's stored as data in a virtual object which is not active, or the vat as a whole is paged out, in which case disk
  • client vat c-list: disk
  • kernel object table: disk
  • target vat c-list: disk
  • target vat object (export): RAM, or disk for inactive virtual object or inactive disk
  • comms vat c-list: disk
  • comms vat routing tables: currently RAM, but the plan is to move all of that to disk

So I think idle caps shouldn't take up any RAM when we finish implementing virtual objects, paging out idle vats, and comms tables on disk.

@warner
Copy link
Member Author

warner commented Oct 20, 2020

I had an idea last night to address the timing / drop-notification channel issue.

  • the vat worker (the "supervisor" that lives in the same process as liveslots and the vat code, and which handles the quiescence tracking) creates a Set, which will hold vat-side object IDs (vrefs) that are ready to be dropped. It shares this Set with liveslots.
  • the FinalizationRegistry callback adds a vref to this set, which can happen at any moment (to be precise, between any turn), even when the vat is supposed to be idle, because other vats, or the kernel, might do something which triggers an engine-wide GC pass, which finally notices some of the idle vat's objects.
    • we maintain the "vat does nothing when it's idle" property by ensuring that the only thing the callback does is add the vref to the set. This is weaker than before (vat code, specifically liveslots, still gets control when the vat is supposed to lack agency), but still maintains the "no syscalls from supposedly-idle vats" rule which is what we really care about.
  • each time a delivery arrives at liveslots, just after we've unserialized all the references (so now we have strong refs to the Presences/etc cited in the message), we remove all the cited vrefs from the set
  • when the vat becomes quiescent, the vat worker/supervisor code is responsible for sending a "delivery result" back to the kernel (via whatever channel it has to the kernel-process-side vat manager). This code will include a copy of the Set (as an Array of vref strings) in the VatDeliveryResult message (so ['ok', droppedVrefs] instead of just ['ok']). The Set will be cleared at the same time.
  • the VatManager, upon receipt of the VatDeliveryResult, will delete these entries from the c-list and push the corresponding kernel-side reference strings (krefs) onto the maybe-free list, for later processing.

Benefits are:

  • the kernel can continue to believe that vats are idle unless a delivery is active: the kernel will not receive syscalls at random times
  • drop is not delivered through a syscall, so we don't have to change the transcript replay checks to exclude it
  • I think this should correctly handle the race/TOCTTOU hazards. The finalization callback, by careful design of our TC39 heroes, runs in its own turn, so the Set won't be modified during beginning-of-crank delivery processing, nor during end-of-crank processing, both of which happen in a single turn (no await breaks)
    • the vat might lose some objects while it's supposed to be idle, and these need to be reported eventually (the image in my head is: you wake up in the morning, and your slippers are missing, because the GC cat took them away while you were sleeping)
    • the new delivery might cite a vref which the vat lost between deliveries (the Slipperman, like a milkman but delivers slippers, brought you a new pair first thing in the morning)
    • the deserialization of the vref will re-populate the liveslots table: same o-NN key, new Presence object (the full-service slipperman puts the new slippers right back in the place where the old ones were)
      • vat code dropped the old Presence so it can't tell that it's been given a new object (you weren't wearing your slippers when the cat dragged them away, so you won't realize you've got new ones)
      • we must remove the vrefs from the Set at that point (you're no longer missing your slippers)
    • during the crank, the vat might drop the refs again, which means we should add them back to the Set (the cat might steal your slippers during the day)
    • at the end of the crank, we report the Set to the kernel (the, um, Department Of Slipper Auditing, to whom you must report the total number of slippers in your household for tax purposes, wants to know whether they're present or not)
    • there's no way to acquire a new copy of the vref until the beginning of the next crank, so it is correct to drop everything in the Set at this point. There might be more drops between end-of-crank and the next start-of-crank, but there won't be any additions.
  • if finalization callbacks happen promptly, this will free up more objects than a scheme in which syscall.drop is made at the next start-of-crank: we have a larger window to observe the callbacks, allowing us to report the drops earlier.

The drawback is that it still suffers from the late-drop issue above. A vat which stops using a bunch of objects, but for which the finalization callbacks don't occur until after the end-of-crank is processed, will hold on to those references until some future delivery to that same vat. If the vat is now idle for a long time, that's a lost GC opportunity.

Worse, from my experiments it appears that Node.js (at least) decided to make GC notifications ride the IO/timer queue, not the promise queue. So nothing will be added to the Set at all until after our waitUntilQuiescent() call completes. It might be sufficient to add a second waitUntilQuiescent to the supervisor's delivery code, to allow these notifications to run before we sample and clear the Set.

I don't know how GC happens in our JS engines. I'm used to Python, where refcounting causes non-cyclic GC to happen immediately (too fast, in fact, the weakref callback can happen in the middle of some other bytecode operation, so the only safe thing you can do in the callback is to write to a pipe-to-self and notify a safer handler). The mark-and-sweep is only needed to free up cycles, and you can minimize the need for that by manually breaking cycles when your operation is done. In JS, I can imagine smooth performance could be improved by deferring all GC until some later time, even if a refcount drops to zero and you know you could free the memory immediately. If that's the engine's choice, we might have a tradeoff between how long we wait at end-of-crank and how much GC happens in a timely fashion.

@erights
Copy link
Member

erights commented Oct 21, 2020

I don't know how GC happens in our JS engines. I'm used to Python, where refcounting causes non-cyclic GC to happen immediately (too fast, in fact, the weakref callback can happen in the middle of some other bytecode operation, so the only safe thing you can do in the callback is to write to a pipe-to-self and notify a safer handler).

No refcounts in JS so GC never happens immediately by simply dropping a reference. GC might happen in a turn, but our WeakRef design delays all notifications to happen in later turns. As you suspected earlier:

I think this should correctly handle the race/TOCTTOU hazards. The finalization callback, by careful design of our TC39 heroes, runs in its own turn, so the Set won't be modified during beginning-of-crank delivery processing, nor during end-of-crank processing, both of which happen in a single turn

The rest of this seems right. Good stuff!

@erights
Copy link
Member

erights commented Oct 21, 2020

Node.js (at least) decided to make GC notifications ride the IO/timer queue

This does not surprise me. There was a move to have the spec require this, but I think that died. Not sure --- we should check the spec. If it did die, XS may gc more promptly. But how significant is the difference to us? Doesn't it just mean that things might stick around for one more crank than they should? Do we care?

@warner
Copy link
Member Author

warner commented Oct 21, 2020

No refcounts in JS so GC never happens immediately by simply dropping a reference.

Oh, that's disappointing. My initial tests are showing that if I don't provoke Node with a gc() call (under node --expose-gc), then a simply-dropped non-cyclic object does not get finalized "soon", meaning I dropped the reference and waited a few hundred cycles of the timer/IO queue and never saw the finalizer get called.

GC might happen in a turn, but our WeakRef design delays all notifications to happen in later turns.

Not just "later", but it could be much much later. If the engine defers GC until it is feeling memory pressure (hours? days?), and our design depends upon the importing vat getting at least one delivery after the finalizers have run, then big+intermittent vats will be a problem: they might know the imports can be dropped, but they can't safely express it until they get some runtime, and that won't happen until someone sends them another message, which could be days or weeks later, if ever.

This does not surprise me. There was a move to have the spec require this, but I think that died. Not sure --- we should check the spec. If it did die, XS may gc more promptly. But how significant is the difference to us? Doesn't it just mean that things might stick around for one more crank than they should? Do we care?

If the notification happened promptly (e.g. because of a refcount going to zero):

  • if the notification happened on the promise queue, the finalizers would run while the vat is still processing the delivery, so the Set of dropped refs would be known by the time waitUntilQuiescent finishes (i.e. the promise it returned has resolved)
  • if it happens on the timer/IO queue, the finalizers would run just after waitUntilQuiescent finishes, so the Set wouldn't be updated until after the vat had finished with the delivery, so the kernel wouldn't learn about them until the next delivery (which could be days away, for a little-used vat)
    • we could compensate for that by calling waitUntilQuiescent a second time, immediately after the first, to give the finalizers a chance to run before the delivery is complete

But since notifications don't appear to happen promptly, the difference is moot. They'll happen some arbitrarily long time in the future (depending upon memory pressure), and then on the delivery after that point, the vat will have a chance to drop the imports and allow the kernel to do it's higher-level refcounting.

I'm feeling drawn back to dispatch.bringOutYourDead, but I have no idea how frequently it ought to be called.

@warner
Copy link
Member Author

warner commented Oct 22, 2020

Notes from today's kernel meeting:

The VatTP refcounting protocol is designed to correctly manage object-table drops between two asynchronously-coupled systems. Each side maintains a table of counters, indexed by object-id. When A sends a message to B, it increments the counter for every object-id included in the message. When B receives the message, it maps the object-ids to local representatives or Presences by looking at its table. If the table has an entry for the object-id, B increments the counter and uses the old Presence. If there is no such entry, B creates a new Presence and sets the counter to 1. When B learns that the Presence is no longer referenced locally (e.g. through a FinalizationRegistry or equivalent), it sends a DECREF message to A along with the last counter value that was stored. When A receives DECREF(object-id, N), A decrements its counter by N. If the result is zero, A removes the entry from its table (and propagates the drop further upstream, typically by dropping its strong reference on the exported object). If the result is greater than zero, that means a new reference (from A to B) passed the DECREF (from B to A), and we expect B to reconstruct a new Presence.

In swingset, the link between kernel and vat is synchronous. However the link between Presences becoming unreferenced and the finalizers running is asynchronous (because the finalizer runs in some later turn, perhaps much much later, since JS likes to do extremely lazy GC). We can use this same protocol.

The liveslots import table will map object-id to a record containing a Presence and a mutable counter. The finalizer will reference the mutable counter:

if (!slotToVal.has(vref)) {
  const counter = { count: 1 };
  const p = createPresence(vref);
  slotToVal.set(vref, { p, counter });
  registry.register(p, counter);
  return p;
} else {
  const { p, counter } = slotToVal.get(Vref);
  counter.count += 1;
  return p;
}

The finalizer will send a DECREF(counter.count) to the kernel. The kernel's c-lists will contain a count of mentions, incremented for each appearance of that object-id in each delivery.

@erights had a trick for managing the counter correctly despite a large number of mentions (potentially overflowing any reasonable-sized counter value). He used a deliberately small 16-bit counter. When the receiving side got close (perhaps 2^15, there is a race), the receiver would send a DECREF for the old value minus one, and reduce their counter to 1 (not zero). The sender would see the DECREF and reduce their counter with no risk of prematurely dropping the object. This works as long as the rate of wire-crossing messages isn't high enough to risk sending more than the threshold value.

I'm still trying to find a sensible channel for vats to send these DECREFs to the kernel, but this protocol probably makes it safer to send the messages outside the context of a given message delivery. The kernel would react to DECREF by adding the (object-id, decrement-count) pair on the maybe-free list, and then at a safe time (between cranks) it could examine this list and perform the subtraction.

The trouble with DECREFs that can be sent at any moment is how a nominally-blocking vat-worker subprocess should deliver them. I was hoping these subprocesses could operate with a single pipe to the kernel process, spending all of its idle time in a blocking read, either waiting for a new delivery, or for the results of a syscall. But maybe that's not incompatible with sending DECREFs at other times: perhaps the worker subprocess does a blocking read of the kernel->worker pipe, but is allowed to write to the worker->kernel pipe any time it likes. The kernel will be doing a non-blocking read of the worker->kernel pipe anyways (since it supports multiple worker processes), so it could receive DECREFs at arbitrary times. We would need to add some synchronization on shutdown to make sure we don't drop DECREFs as the process is being torn down, though.

warner added a commit that referenced this issue Oct 26, 2020
These two authorities are not part of SES, so they must be pulled from the
globals of the Start Compartment and ferried through the kernel to the vat
manager factory that calls makeLiveSlots.

This gives the outer layer of the vat (liveslots) access to nondeterminism.
We rely upon liveslots to not share this power with the user-level ocap-style
"vat code". Liveslots must never allow user-level code to observe behavior
that depends upon GC activity, because that activity is not part of the
specified input to the vat.

refs #1872
warner added a commit that referenced this issue Oct 26, 2020
Node.js v14 provides `WeakRef` and `FinalizationRegistry` as globals. Node.js
v12 does not (there might be a command-line flag to enable it, but I think
it's marked as experimental).

Rather than require all users upgrade to v14, we elect to disable GC when
running on v12. This change attempts to pull `WeakRef` and
`FinalizationRegistry` from the global, and deliver either the real
constructors or `undefined` to the liveslots code that uses it. We'll write
that liveslots code to tolerate their lack.

refs #1872
refs #1925
warner added a commit that referenced this issue Oct 26, 2020
The upcoming GC functionality will require `WeakRef` and
`FinalizationRegistry`. Node.js v14 provides these as globals, but v12 does
not (there might be a command-line flag to enable it, but I think it's marked
as experimental). Rather than require all users upgrade to v14, we elect to
disable GC when running on v12.

This adds a local `weakref.js` module which attempts to pull `WeakRef` and
`FinalizationRegistry` from the global, and exports either the real
constructors or no-op stubs.

refs #1872
refs #1925
warner added a commit that referenced this issue Oct 26, 2020
These two authorities are not part of SES, so they must be pulled from the
globals of the Start Compartment and ferried through the kernel to the vat
manager factory that calls makeLiveSlots.

This gives the outer layer of the vat (liveslots) access to nondeterminism.
We rely upon liveslots to not share this power with the user-level ocap-style
"vat code". Liveslots must never allow user-level code to observe behavior
that depends upon GC activity, because that activity is not part of the
specified input to the vat.

refs #1872
warner added a commit that referenced this issue Oct 26, 2020
In the future, when liveslots implements GC and discovers that a Presence has
ceased to be referenced by the user-level vat code, it will call this
function to decrement the kernel's reference count for the imported object's
c-list entry.

`vatDecRef` might be called at any time (although always in its own turn).
The kernel will eventually add the decref information to a queue, to be
processed between cranks. For now, the kernel only records the information if
an option was set to enable it (for future unit tests).

Most of this patch is the kernel-worker protocol wiring to allow
child-process vat workers to deliver the decref back up to the kernel
process.

Liveslots does not use this yet. A future patch will switch it on.

refs #1872
@warner
Copy link
Member Author

warner commented Dec 2, 2020

"Formal" vs "Local" Reference Graphs

We start by acknowledging two distinct views of the reference graph. The "formal" graph is the one agreed to by consensus. It must be a deterministic function of the specified vat/kernel behavior, independent of any JS engine-specific behavior (such as garbage collection and heap snapshots). In contrast, the "local" graph is private to each member of the consensus machine: each validator may have a different local graph. The local graph is allowed to be influenced by non-deterministic actions like GC weakrefs and finalizers. If we weren't obliged to provide a stable consensus view that remains consistent under replay, the local graph would be the only one we cared about: local references are "actual" references.

The local graph will be a non-strict subset of the formal graph. If an object is referenced locally, it must also be referenced formally. But it might be referenced formally without also being referenced locally. This latter state (a "nominal" reference: formal but not local/actual) is expected to be the most common.

formal-vs-local GC

Where References Come From

A swingset machine contains two tables which track kernel objects and promises respectively, each of which has a kernel-side identifier (a kref). These are the nodes of the reference graph. We have four places in the kernel which can contain krefs, which form the edges of the graph:

  • The kernel run-queue contains queued deliveries (for message-sends like x~.foo(args)) and notifications (to inform a vat about the resolution status and/or data of some promise to which it previously subscribed). These queued events will reference kernel objects (as the target of each delivery) and promises (as the result promise of each delivery, and as the subject of each notification). In addition, the serialized arguments and resolution data can contain references to both kernel objects and promises.
  • The kernel promise table will contain the resolution data for any promise which has been resolved but not yet retired. This resolution data can contain object/promise references.
  • The kernel object table currently only tracks the owner of an object (the vat to which its messages should be sent). But we've considered (Add reflection information for presences #59, Clearer presence support; towards remote interfaces and Una #804) allowing pass-by-presence objects to include some amount of immutable data, such as an Interface name (a string). If this feature grows to allow this data to include references to other objects, the kernel object table would become a source of references.
  • Finally, each vat is associated with a c-list, which maps the kernel-side krefs to vat-side vrefs, and back again.

Kernel Data Structures

The kernel object and promise tables include a formal reference count. This lets us discover nodes which are formally unreachable without performing a full mark-and-sweep pass (but of course one will be necessary to prune cycles). When the formal refcount drops to zero, the object is formally unreachable, and can be deleted entirely. All validators of the consensus group will have the same formal refcount.

These tables also contain a local reference count, to discover locally-unreachable nodes. Each validator may have a different local refcount.

Finally, each entry has a "local in-use" flag that indicates whether this entry is locally reachable or not. If the local refcount drops to zero (but the formal refcount is nonzero), this flag is cleared, and any outbound references from the table entry are decremented from the local refcount of the targets. (This flag might be redundant; we could use local_refcount === 0 instead).

The c-lists for each vat contain a kref, a vref, and a "local in-use" flag. If this flag is True, the c-list entry counts towards both the formal and the local refcount for the target object/promise. If it is false, the entry only counts towards the formal refcount.

What We Agree To

Our consensus engine will operate upon a subset of the kernel state. To support fast-sync, we need to maintain consensus on enough kernel state to allow a new validator to download and verify that state and then launch vats with less effort than starting the entire chain from scratch. In particular, these new validators can avoid replaying dead vats. However, since they cannot take advantage of (engine-specific) heap snapshots, they must replay the transcript of each live vat from the vat's initial code bundle (but they can be lazy and only load vats on demand).

We redact the "local refcount" and "local in-use" columns from the consensus state. To facilitate this, these columns may be stored at different keys of the HostDB. However their values are still updated in atomic transactions with the rest of the entries.

We include vat transcripts in the consensus state, but these do not contain the local-only messages described below.

Imports, Presences, WeakRefs, Finalizers, The Dead Set

"Vat Imports" are created when a delivery arrives at the vat and cites (as an argument) some vref that the vat does not remember seeing. The liveslots layer of the vat deserializes the inbound arguments, which consults the slotToVal table for each vref. If the entry is missing, liveslots will create a new Presence (or Promise) object, create a WeakRef around it, and store the WeakRef as the value of slotToVal. It also uses the Presence directly as the key of the valToSlot WeakMap for translation in the outbound direction. Then, liveslots adds the Presence to a FinalizationRegistry, using the vref as the associated data. These finalizers will be described later. Finally, the vref is removed from the "Dead Set", also described later.

If the slotToVal entry is present, it will be a WeakRef, however the target may or may not be alive. If liveslots is able to use .ref() to retrieve the previously-stored Presence, that presence is used directly. The vref is again removed from the Dead Set before further processing. If the WeakRef was dead, liveslots creates a new Presence object, replaces the WeakRef, and again removes the vref from the Dead Set.

(This test-for-liveness and active management of the Dead Set replaces the increment/decrement counter in the previous design).

The "Dead Set" is a Set of vrefs which are known to be referenced formally but not locally. The finalizer callback looks at the vref whose Presence was recently garbage-collected. It looks in slotToVal for a WeakRef. If the vref is missing, or if the WeakRef it finds there is dead, the vref is added to the Dead Set. This guards against a finalizer running after a vref has been re-introduced into the vat: the Dead Set must not contain any vref that is still locally reachable. If the WeakRef is present but dead, the finalizer deletes the WeakRef from the slotToVal table (to save memory, not for correctness).

The kernel has a special delivery type named bringOutYourDead (perhaps we could find a more offcial-sounding name, but the Monty Python fan in me couldn't let go of the image). This method returns and clears the contents of the Dead Set. These are vrefs which are formally reachable but not locally/actually reachable. This delivery is local-only: neither the fact of its invocation nor its results are included in the vat's transcript, and it remains out of the consensus state.

Each vat has a way to tell the kernel that it might have a non-empty Dead Set (a function it can invoke). It will call this when the finalizer runs. The kernel adds the caller's vatID to the "May Have Dead" set that will be serviced later (between cranks). Finalizers can run at any moment (in their own turn), so it is not safe for the kernel to do anything but queue the information for later.

Each vref thus goes through the following state diagram:

  • UNKNOWN: not present in slotToVal / valToSlot / deadSet
  • LIVE:
    • slotToVal.get(vref).ref() === Presence
    • valToSlot.get(presence) === vref
    • deadSet.has(vref) === false
  • (locally) UNREACHABLE: dead but not collected, because GC hasn't yet run. The Presence isn't reachable from vat code, but the WeakRef still knows about it
    • slotToVal.get(vref).ref() === Presence
    • valToSlot.get(presence) is uninvokable by vat code
    • deadSet.has(vref) === false
  • (locally) COLLECTED: GC has run, the WeakRef is dead, but the finalizer hasn't run yet
    • slotToVal.get(vref).ref() === undefined
    • valToSlot.get(presence) is uninvokable by vat code
    • deadSet.has(vref) === false
  • FINALIZED: the finalizer has run
    • slotToVal.get(vref) === undefined (because the finalizer deletes the WeakRef)
    • valToSlot.get(presence) is uninvokable by vat code
    • deadSet.has(vref) === true (added by the finalizer)

When the kernel first informs a vat about a vref, it moves from UNKNOWN to LIVE. If the vat code then forgets about the Presence, it moves from LIVE to UNREACHABLE. If the kernel re-introduces the vref, it moves back to LIVE, until the vat code forgets about it again. Eventually it moves from UNREACHABLE to COLLECTED when GC runs. Still later it may move to FINALIZED when the finalizer runs.

When the kernel invokes bringOutYourDead, it will be informed about any vref in the COLLECTED and FINALIZED state, and since this clears the Dead Set, those vrefs are thus moved to the UNKNOWN state.

Heap snapshots should preserve these states correctly: a vref in the UNREACHABLE state when the snapshot is created will still be there when the snapshot is restored, and eventually (in its new incarnation) it will be collected and finalized.

Kernel Delivery / Import-Reaping Cycle

When the kernel processes a delivery, it must translate the krefs in the delivery (or promise-resolution notification) through the vat's c-list into vrefs. The same is true for the return value of device syscalls, and (maybe) per-vat storage read syscalls.

For each vref it translates, the c-list's local in-use flag is set. This happens both for newly-allocated vrefs and for existing ones. When the kernel tells a vat about an object or promise, we are telling it both "for real" (we expect the vat to create a Presence, which is a real/local reference), and "formally". The presence of the vref/kref in the c-list constitutes the "formal" reference, while the truthy state of the local-in-use flag makes up the "local" reference.

(TODO: this portion is about vrefs that represent imports, not exports. The local-in-use flag for exports is probably handled differently. I need to think about how this affects the delivery sequence.)

Later, after the delivery is complete (we have committed the crank to the HostDB, including any changes to the c-list, including the local-in-use flag), the kernel processes any queued "maybe dead" events by processing the vat-IDs in the "May Have Dead" set. This is the set of vats which had finalizers run recently, which might include the vat to which the delivery was just made, but might also include arbitrary other vats (perhaps GC was triggered by memory pressure). In Node.js it appears that the finalizers don't run until the IO queue is serviced, so we might want to insert an additional setImmediate() call or two so they have a chance to fire.

For each vat-ID being processed, the kernel invokes the vat's bringOutYourDead() function, which returns a list of vrefs. Each vref in this list is known to be not locally reachable. At this point, the vat is quiescent (liveslots was given control to execute bringOutYourDead but the vat-level code was not, so the promise queue should be empty). With no queued turns to cause the deserializer to run, there is nothing to resurrect a dead vref until the next delivery is made. In contrast, a vref which was previously alive (weakref.ref() !== undefined) but not reachable ("UNREACHABLE") might spontaneously die at any point in the future (when the engine decides to perform GC), moving to COLLECTED or FINALIZED.

The kernel translates each vref through the c-list into a kref, then clears the c-list's local-in-use flag, but leaves the rest of the c-list entry intact. The kernel then looks up the kernel object/promise table entry for the kref and decrements the local-refcount field, to reflect the vat no longer holding a local reference to that object. The decremented kref is added to a "maybe free" Set.

When all the queued finalizer events have been processed, from all vats, the kernel can process the Maybe Free Set (it could also defer this for later, if desired). Each kref in this set indicates a kernel object/promise whose local-refcount was reduced recently. We examine this refcount to see if it dropped to zero. If so, the object/promise's local in-use flag is cleared, and we loop through all the krefs referenced by the now locally-unreachable object. For each one of those items, we decrement it's local-refcount and add the kref to the Maybe Free Set for further examination.

When we're done, some number of kernel objects/promises may be locally unreachable. They will probably be formally reachable: the consensus state cannot be told about GC-triggered events, so the consensus state thinks these items are still alive. However we know that they will not be invoked again, so they can be marked as unused (and unusable) locally.

Note that any reference from the run-queue will prevent this: anything reachable from the run-queue must be kept alive (for real, not just formally), so that messages can be delivered. However the resolution data of locally-unreachable kernel promise will not keep any included kernel objects locally-alive.

Kernel Export-Reaping Cycle

For each newly- locally-unreachable object, the last remaining local-in-use citation will be in the c-list of the exporting vat. The kernel needs to clear this flag and perform a special dropLocalExport(vref) delivery to the vat. Like bringOutYourDead(), this delivery is excluded from the transcript and the consensus state.

dropLocalExport() tells liveslots to look up the Remotable which was originally exported, and delete it from the exports Set. Once this happens, the only remaining strong reference to the Remotable (if any) will be in the vat code which originally exported it. If such a strong reference exists, the valToSlot WeakMap will still know about it, and the vat might re-export it again in the future (whereupon it will get the same vref as before, and since the kernel c-list entry is still present, it will get the same kref, etc). But if the exporting vat forgot about its export (which we expect to be the majority case), this will allow the Remotable to be reaped.

Liveslots should attach a Finalizer to the exported Remotable, but unlike the import finalizer (which will add the vref to the Dead Set and notify the kernel that it May Have Dead), the only job of the export finalizer is to delete the WeakRef from the slotToVal table, to save some memory.

Do Refcounts Include The Export?

(TODO: this still needs more thought)

We need to learn when a vat's exported object/promise is no longer ("locally") referenced by the kernel. For this, we must decide whether the refcounts should include the exporting vat or not. One possibility is:

  • we declare that local-refcount does include the exporting vat
  • when the local-refcount drops to 1 (not 0), we call dropLocalExport, clear the local-in-use flag in the exporting entry, and reduce the local-refcount to 0
  • when it falls to zero, we can decrement anything the now-locally-dead object/promise was referencing, which walks the next set of edges in the graph

This might help with scheduling of the dropLocalExport call, which we might queue for later if we're busy. (Calling dropLocalExport is likely to trigger more finalizers, and maybe more GC work, so it might be good to have a way to defer it until we're idle). When we've marked the object/promise as non-(locally)-imported, but not yet delivered dropLocalExport, its local-refcount will sit at "1". If we make a regular (message) delivery to the vat before we manage to do dropLocalExport, and that delivery causes the object to be re-exported, that will re-increment the local-refcount (e.g. because the kref is in the arguments of a new run-queue message). When the dropLocalExport gets to the top of whatever queue it's on, we can cancel the call if the local-refcount is no longer "1".

I'm not yet sure how to think about Promises here. For objects, there is one clear exporting vat, and some number of importing vats, and the object is "dead" once the exporting vat is the only one left. But for Promises the decider can shift around. Any vat which has subscribed to hear about resolution (which is any vat that has ever heard about the promise) has basically imported it: vat code was given a Promise object and it might remember it.

Note that references coming from the kernel object/promise tables (e.g. in resolution data) always provide a formal reference, and might also provide a local reference if (and only if) the "local-in-use" flag is set. References coming from the run-queue always provide both a formal and a local reference. This way the formal refcount will never be lower than the local refcount. Something might be kept alive by an entry in the run-queue even though no vats have it in their c-lists (perhaps the result promise of a message sent by a vat which then dies before the message is delivered).

Dropping Formal References

While we expect most opportunities for GC to occur locally, when importing vats have their WeakRefs die and their finalizers run, we may occasionally be able to delete formal references too. This is a consensus-visible change to the reference graph, and can thus reduce the amount of information that new (fast-sync) validators must process. It also allows the kernel to delete object/promise/clist table entries for real, saving space in the database (not just RAM in the exporting vat). Clearly we would prefer to do lots of GC in the full formal reference graph, because that provides the most savings, but in practice we expect most to be merely local.

When an importing vat dies, all its c-list entries are deleted. This removes the formal references to the krefs it used to contain, which decrements the formal refcount field of the kernel tables. If these drop to 0 (or 1, see above), we can formally delete the kernel table entry. Note that the local-in-use flag is effectively cleared when the c-list entry is deleted, so if it wasn't already False, this decrements the local-refcount field too. The local-refcount should never be higher than the formal-refcount.

We may allow the exporter of an object to "revoke/terminate/cancel" it, after which any invocation (in any vat) will throw an error instead of delivering a message through the kernel to the exporting vat. If, in addition, we do not require/allow these revoked objects to retain their identity, then the importing vat might not need to reference the kref anymore (note that this seems unlikely: we've generally assumed that dead objects retain their identity). If that ends up being the case, then the object-revocation sequence might allow the importing vat to formally drop its import, eventually allowing the exporting vat to formally drop the export.

Another possibility is the use of alternate languages inside vats. While JavaScript doesn't have a deterministic specification for GC, other languages might (or we might choose to settle on a specific engine, with deterministic GC). Vats which execute this other language could inform the kernel that they no longer need an import with a new syscall.dropFormalImport. This syscall would be included in consensus and the vat transcript.

Finally, the comms vat might receive notification from the remote machine that an export is no longer needed. The comms vat would use syscall.dropFormalExport to inform the kernel of the change. Everything the comms vat does is part of consensus, so the comms vat will never visibly react to an incoming dropLocalImport, nor would its bringOutYourDead() ever return anything.

For the exporter side, I expect we'll have a delivery named dropFormalExport. Unlike the local version, this delivery is part of consensus, and gets included in the vat transcript.

When a vat is referencing off-machine objects via the comms vat, and that vat dies, we should definitely take advantage of the opportunity to prune the comms tables, so dropFormalImport and dropFormalExport are certainly necessary, even if it requires vat termination to exercise.

Replay

When the kernel process is restarted, it will need to replay any active vat from its transcript (this can be deferred until someone talks to the vat, and avoided entirely for dead vats). We want to make sure that any RAM savings we managed during the original run of the vat are also achieved in the replayed version.

During replay, syscalls are disconnected. The vat worker executes all deliveries without talking to the kernel: each syscall the vat makes is compared against the transcript, but not delivered to the kernel. As this runs (or more likely during some setImmediate breaks we should insert in the replay process), the Presence finalizers will fire, adding those vrefs to the Dead Set. When replay is complete, the Dead Set will be full of vrefs that were "delivered" (from the transcript) and then dropped by the vat. Ideally this will contain nearly all of the imported vrefs: the only remaining locally-reachable vrefs should be the last few, the ones that are still in active use.

At this point, at the end of replay, the kernel or vat worker should call bringOutYourDead(), and get a large list of vrefs. This list should be processed just like the regular kernel loop would: iterate through the dead set, for each one see if the c-list's local-in-use flag needs to be cleared, if we clear it then decrement the local refcount, etc.

We might see locally-unreachable vrefs during replay that we didn't see the first time around (the vref is mentioned by bringOutYourDead but its c-list local-in-use flag is still set). This means that the Presence was UNREACHABLE or COLLECTED at the time we committed the last crank, but we just didn't notice yet because it hadn't moved to FINALIZED.

There might be a vref whose local-in-use flag is clear, but which does not appear in the bringOutYourDead result. This tells us the Presence made it all the way to FINALIZED in the previous run, but in this run it is still in the UNREACHABLE or COLLECTED state. We could expect to see it mentioned in a future bringOutYourDead call, at which point we'll ignore it (the local-in-use flag is already clear, so it won't qualify as a decrement).

We might see a bringOutYourDead for a formally-unreachable vref (the c-list entry is missing entirely), which can be ignored. This indicates that our previous invocation had (eventually) done a syscall.dropFormalImport, allowing us to delete the c-list entry. We expect to see the dropFormalImport call appear somewhere later in the transcript.

On the export side, we need to reconcile the exports of the replayed vat against the known-locally-referenced set in its c-list. As the transcript is played, liveslots should collect a set of exported vrefs. When replay is complete, some mechanism should iterate through this set and check the vat's c-list for each: if the local-in-use flag is False, the vat should be given a dropLocalExport for that vref. This gives the replayed exporting vat a chance to shed its unused exports (which, again, we hope will be most of them).

Incremental GC during replay

The previous algorithm applies all GC results at the end of transcript replay, which means long-lived vats will build up some huge heap and then discard nearly all of it. It would be better to perform whatever GC we can during the replay process, rather than only at the end.

For exports, this is pretty easy: the vat worker's replay loop can perhaps execute 100 transcript entries, collecting exported vrefs, then pause to probe the c-list for those vrefs, delivering dropLocalExport for any that are missing. This is enabled by fact that we replay each vat independently: we don't need to remember when the export became unreferenced. The fact that the previous kernel's last memory was that the export was un-locally-referenced means that we can "predict the future" during replay and tell the vat to dropLocalExport much earlier than we would have during the first time around. Note that a vat which re-exports the same vref multiple times will potentially get multiple dropLocalExport calls for that vref. The important thing is that it receives at least one dropLocalExport call, and that the call appears after the last export happens.

For imports, the memory pressure would come from having a very large Dead Set. We could batch transcript entries as before, and after each batch we call bringOutYourDead and compare the results against the c-list's local-in-use flags. We don't expect the total set of cleared flags to be large, so we could defer actually processing them until after replay was complete.

Formal GC during replay

We can ignore syscall.dropFormalImport during replay (to be specific, we can assert that the replay invokes dropFormalImport at exactly the same times as it did the first time around, just like we do for all syscalls). Its effects on the c-list should already be included in the saved state: the entire c-list entry should be missing. In particular, if the vref was introduced to the vat multiple times, interspersed with drops by the vat, the c-list entry should be present only if the last thing that happened was an introduction, and missing if the last thing that happened was a drop.

dropFormalExport is similarly deterministic: it will appear in the transcript and should be delivered to the vat at the recorded times during replay. This might trigger finalizers to run (if the exported Remotable was used as the key of a WeakMap, and the corresponding value and was keeping some other imported Presence alive), which will be handled by the local-reachable tools as usual.

Heap Snapshots

We currently don't want to specify our chain to use a specific JS engine, which means heap snapshots cannot be part of the consensus state. (This is a pity, because they would provide a super-efficient starting point for fast-sync). Heap snapshots will be used locally, to allow any individual validator to restart faster, but their contents won't be revealed to anyone else.

We must make sure that the recorded HostDB and heap snapshot can cooperate to return us to the same local/formal GC state as the first time through.

The main feature needed, which ought to fall out of any correct snapshot mechanism, is that the LIVE/COLLECTED/FINALIZED state is retained properly. If a snapshot is taken at a point when a Presence is COLLECTED, but the finalizer hasn't run, we need to know that reloading that snapshot into a new heap will eventually call the finalizer. And if the finalizer has run before the snapshot was taken, it must not run again in the new incarnation (although I think our protocol would tolerate this particular misbehavior).

@warner
Copy link
Member Author

warner commented Dec 5, 2020

GC of imported Promises

As mentioned, I'm still thinking through how this relates to Promises, rather than objects (Remotable on one side, Presence on the other). The question is what happens when a Promise gets imported/exported through the vat/kernel boundary, and then gets forgotten by one side or the other.

In general, we hope Promises to be easier to manage than objects, because 1: most Promises eventually resolve, and 2: we aren't obligated to maintain identity for resolved Promises. As a result, when a resolution is sent across the vat/kernel boundary, we also "retire the vref" (i.e. the vpid). This removes it (formally) from the vat's c-list, as well as the internal liveslots table. "The matter is finished, the question is settled, let us not speak of this again", which is enforced by the fact that both sides delete all record of the vpid so they can't bring it back up. The kernel-side promise table entry (the kref, i.e. a kpid) remains present until all other formal references to the kpid are removed, at which point it can be formally GCed and the table entry deleted. If the kpid is sent back into the vat before it can be deleted, the kernel will create a new vpid to describe the (now-resolved) Promise, add a new c-list entry (with the new vpid and the old kpid), and will send a resolution into the vat immediately following the message that referenced it.

This gives us more opportunities for formal GC than for objects, because we're allowed to delete the vref even though userspace might still hold a reference to the JS Promise object. But we still need to understand what happens when vat code stops referencing a Promise, and what the kernel observes. We're using the same slotToVal structure for everything, so we need to know what that WeakRef will see.

JavaScript Promises And Their References

The ECMAScript specification on the Promise constructor (and on Promise instances) is intentionally silent on the question of which objects retain strong references to which others, because overspecification would limit engine implementators' ability to add performance optimizations. In addition, Promise can be subclassed, so overridden methods might want access to things that the default implementation does not, introducing more references.

There are basically four things of interest: the Promise itself, the resolution functions (resolve and reject, as handed to the "executor" callback passed as new Promise(executor)), any callbacks provided to Promise methods like .then/.catch/.finally, and the value to which the Promise is ultimately settled.

There are two references we can assume exist based upon their importance to keep Promises working as expected:

  • If you have access to a settled Promise, then you can use .then to obtain access to the settlement value. So the Promise must keep a strong reference to this settlement.
  • If someone has used .then to attach callbacks to an unresolved Promise, then anyone with access to resolve/reject can cause those callbacks to be run, regardless of whether the Promise is still reachable. So either resolve/reject must keep a strong reference to the list of unexecuted callbacks, or they must references something else which does the same.

I ran some WeakRef experiments with Node.js (version 14.8.0) to deduce where the internal references seem to lie:

promise-gc

In this diagram, the dashed lines (A -> B) mean there appears to be a strong reference from A to B (i.e. B will not be GCed unless A is GCed first), but the specification doesn't appear to require one. Note that this does not mean someone holding A can get access to B: the strong reference is internal. The solid line indicates both a strong reference and that the holder of A can get to B.

It appears that the resolve/reject functions hold a strong reference to the Promise object itself, whether in a resolved or unresolved state. And an unresolved Promise seems to hold a strong reference to any .then/etc callbacks which have been added (these refs go away when the Promise is settled and the callbacks are invoked). A settled promise holds a strong reference to the settlement object, as expected.

As we build something around this, we must be prepared for the dashed edges to be missing, because engines are not required to maintain them. For example, there's no obvious reason that the resolve function needs to keep the Promise alive: it needs to keep the queue of callbacks alive, but not necessarily the Promise itself.

There is no obvious way to sense when .then has been called, which introduces a requirement described below, and is unfortunate because the sendOnly pattern is very common, and if we could sense that a Promise was dropped without ever having a .then function called on it, we could GC the Promise object sooner. (Note that one of @michaelfig 's source-source transforms for the tildot operator would parse the code to spot this case, and use a different HandledPromise invocation, which would solve the problem at a different layer).

Terminology

For objects, which have a single clear exporter (and zero or more importers), we have three nouns: Remotable, Kernel Object, and Presence. The first and last are used in the exporting and importing vat, respectively. They are both vat-code -visible JS Object instances that coordinate with the more-abstract kernel-side entity, through syscalls and deliveries. Within the kernel, the "kernel object" is not represented by an actual JS Object (remembering the kernel might not even be written in JavaScript): instead, it is represented by a row in a database table, and referenced by short strings (kpid values, like ko24) rather than references in a JS object graph.

For Promises, the situation is ever so slightly murkier. Promises have a single creator, but its identity isn't very interesting. Instead, message routing depends upon the current decider, which can change over time as resolution authority is passed from a vat, into the kernel, and off to some other vat, before being consumed in an act of resolution. The abstract "promise" is represented in all vats by an actual JavaScript Promise object, plus a variety of liveslots tables which coordinate its behavior with the kernel. User-level vat code can interact with this Promise as if it were the same Promise that every other vat was using (modulo things like resolved promises losing their identity). As with objects, within the kernel there is a database table entry to manage the promise state, rather than a real JavaScript Promise object.

A vat might create a native Promise object and then export it into the kernel (as the argument of some message delivery, or inside the resolution of some other promise). In this case, liveslots must arrange to notify the kernel when the local vat resolves that Promise. Vats also create a result Promise in the process of sending a message through the kernel: the liveslots code that implements HandledPromise.applySend creates and returns a Promise instance. In this case, liveslots must prepare for the kernel to resolve the promise, and cause the Promise object's resolve/reject callback to be invoked at the right time. Vats might also import a promise from the kernel (in arguments), which prompts liveslots to create and deliver a Promise just like it would for the result of an outbound message send.

Vats Which Import a Promise

When a vat imports a Promise, convertSlotToVal notices the new vpid and uses makeImportedPromise(slot) to build a new HandledPromise for delivery to the user-level vat code. The resolve/reject functions are stored in a "pRec", which is itself stored in the importedPromisesByPromiseID table (a strong Map). The Promise itself goes into slotToVal (a Map) and valToSlot (a WeakMap), just like a Presence would be during object import, so that future inbound references to the promise (via slotToVal) are translated into the same Promise object, and so future references this vat makes in outbound messages are translated (via valToSlot) into the same vpid vref.

Later, if/when the kernel notifies the vat of the promise being resolved, dispatch.notifyFulfill* looks up the pRec and invokes the resolve or reject function.

When we change slotToVal to map to WeakRefs instead of strong references, Promises will be treated the same way (so convertSlotToVal doesn't need to special-case anything). We must ensure that Promises are kept alive, when appropriate, using something comparable to the exports set proposed above for objects.

I'm still working things through, but maybe we never put vpids in the Dead Set, and never do local-only GC of promises, because I don't think JavaScript gives us enough information to know if/when that would be safe (but I could be wrong). And I'm hoping that basically all promises get resolved, so formal GC can happen, and we won't be left with enough garbage to be a concern.

The overall structure will look something like this:

GC of imported promises

We'll have a finalizer on the Promise, but its main job is to delete the slotToVal entry (freeing the WeakRef). I don't yet know if it should put the vpid in the Dead Set. I do know that any vref that gets formally deleted should be excluded from the Dead Set, so when we retire the vpid (because the promise was resolved), we should no longer pay attention to the finalizer firing.

User code might receive an imported Promise, use .then to attach a callback, and then forget the Promise. If nothing else is holding on to it (remembering that slotToVal is now weak), the Promise would be collected. This by itself would be fine, because nobody still needs the actual Promise, but we do need to ensure that the callback will be invoked eventually. We cannot sense that a callback has been added, so we must arrange for resolve or reject to be called unconditionally.

To this end, we need the importedPromiseByPromiseID table to be strong Map from vpid to the pRec, which will in turn provide a strong reference to the resolve and reject functions. As we saw above, this happens (at least in Node.js) to keep the Promise alive, but more importantly it almost certainly keeps the .then callback list alive, which is what we really care about. If importedPromiseByPromiseID were mapping to a WeakRef (like slotToVal does), perhaps directly to the resolve callback function, then we could be left in a situation where the kernel notifies the vat about a resolution, but liveslots could no longer reach the resolve() it needs to act upon it.

When the promise is resolved, we can delete the importedPromiseByPromiseID mapping. At that point, the Promise will only be kept alive by vat code keeping its own strong reference.

(TODO: investigate what happens during the interval between resolve() being called and the turn where the .then callback is invoked: what does the microtask queue keep alive?)

I'm still thinking through the exporting side, I'll write that up next.

@warner
Copy link
Member Author

warner commented Dec 8, 2020

GC of exported promises

GC of exported promises

On the export side, I think we need to add the Promise to the strong exports set, as we've planned to do for Remotables, but we won't expect to be able to drop the Promise before it is resolved (at which point we can formally retire it and drop it properly).

When user code creates a Promise and sends it in a message (or in the body of resolve-to-data), the act of serializing the as-yet-unseed Promise object will trigger liveslots to call exportPromise. This will use .then to add our thenResolve and thenReject callbacks, which will do some flavor of syscall.resolve when invoked, to inform the kernel that the vat has resolved the promise (instructing the kernel to notify any vats which imported/subscribed to the promise, as well as deliver any messages queued to the promise in the interim). Liveslots also adds the Promise to the valToSlot WeakMap, and will add a WeakRef to the Promise to slotToVal, just like it would for Remotables. Finally, we'll arrange for the Promise to be added to the exports Set, just like Remotables.

It's likely that user code will forget about the Promise immediately after sending it, only retaining access to the resolve and/or reject functions for later. As we've seen above, in practice this means the Promise will be kept alive as well, but we don't want to rely upon that. I'm pretty sure that the then callbacks would be kept alive, which means even if the Promise goes away, the kernel will still be notified when it gets resolved. However, if resolve/reject doesn't keep the Promise alive, and we didn't add the Promise to exports, then we could get into a situation where the Promise was GCed, removing it from slotToVal, which means if any other vat re-delivered the Promise into this vat, it would be treated as a new Promise. That's not ideal, although I can't immediately think of what problems it might cause.

When will exported Promises be dropped?

On the importing side, we must hang on to the Promise (importedPromisesByPromiseID must be a strong ref) until it is resolved, because we can't prove that the user code didn't call then and attach a callback. Our experiments show that resolve/reject keeps the Promise alive, and the Promise keeps the .then callbacks alive, but this isn't guaranteed by the spec. If we were to delete the importedPromisesByPromiseID record when the Promise was GCed, we might fail to resolve the promise when the kernel instructed us to, and userspace's callbacks might never get run.

So, barring some advances in the JS Promise specification (or an importing vat which runs something other than JS), vats have the ability to do a local drop of their imported promise vref, but they'll never use it.

As a result, although we'll probably implement the ability for vats to be told that the kernel no longer (locally) cares about their promise export, we don't expect it to be used. If implemented, this would simply delete the vref from the exports set. Later, if nothing else in the vat is keeping a strong reference to it, the Promise will be collected as usual, and eventually removed from slotToVal and valToSlot.

Instead, we expect to retire Promises if/when they are resolved. The syscall.resolve message (which is being refactored to enable multiple Promises to be resolved in a single batch: #2065), after serialization is complete and the syscall is delivered, will delete the resolved vpid values from the slotToVal table, and will delete the Promise itself from valToSlot. We must remove it from exports at the same time. With that accomplished:

  • 1: both vat and kernel can formally delete the c-list entry for the Promise, which might enable the kernel to formally delete the promise record itself (potentially freeing other objects, as usual)
  • 2: if the (now resolved) Promise is ever re-exported, it will be given a new short-lived vpid, a new syscall.resolve will be sent to re-inform the kernel of its resolution, and then the new vpid will be retired too
  • 3: if user code forgot about the Promise, and the code which invoked resolve or reject goes out of scope (dropping resolve/reject), then there should be no strong references left (we deleted the one from exports). There won't even be a WeakRef left, because we delete the slotToVal entry. So the Promise should be GCed shortly, and nothing in the vat will remember either the vref or the Promise (which hits our memory-usage goal).

@warner
Copy link
Member Author

warner commented Dec 8, 2020

Task List

  • add the dead set (perhaps with a better name)
  • change slotToVal to use WeakRefs
  • add exports Set, change convertValToSlot to add Remotables/Promises to it
  • export/import should remove the vref from the dead set too
  • add Finalizer to imports and exports, giving it the vref string as the associated data. The callback function should examine slotToVal for a live WeakRef, and otherwise add the vref to the dead set. Dead weakrefs should be deleted.
  • add a notification pathway into the kernel that is safe to use at arbitrary times, which queues the vat-id that might have something in the dead set
  • add a phase to the kernel cycle that queries all queued vat-ids for their dead set (introduce a dispatch-like pathway from kernel to vat, which is not recorded in the transcript)
  • add the two refcount fields to kernel object/promise table entries
  • add the local-in-use field to the c-list entries
  • implement the function (described above) to update the c-list entry of the importing vat which has just locally dropped the import
  • implement the kernel table GC sweep algorithm
  • add the kernel-to-vat delivery (non-transcript) that informs the vat of a locally-dropped export
  • probably more

@dtribble
Copy link
Member

dtribble commented Dec 8, 2020

A naming issue: Please use "revoked objects" rather than "cancelled objects". In most systems, "cancel" generally means "stop making it", with a focus of "asynchronously stop consuming resources". Whereas "revoke" means "make it unavailable, preferably synchronously or in order". As a result of canceling, things might get revoked, but that's just one of the actions that could happen as a result of cancelation.

@warner
Copy link
Member Author

warner commented Dec 8, 2020

how exports are dropped

I think dropLocalExport(vref) and dropFormalExport(vref) do the same thing, and the only difference is that the local form doesn't get added to the transcript. Both will:

  • use slotToVal(vref).deref() to get the object (Remotable) being dropped
  • remove the Remotable from the valToSlot WeakMap
  • remove the Remotable from the exports Set
    • this derefs the Remotable, allowing it to be collected if the vat code isn't retaining its own reference
  • remove the vref from the slotToVal Map
    • this derefs the WeakRef, allowing it to be collected

At some point in the future, the Finalizer will fire. The Finalizer callback should do the following:

  • slotToVal.get(vref): if this is undefined, the object was already deleted, so return without additional effort
    • if this yields the WeakRef, query it with wr.deref()
      • if this returns truthy, the object was reborn, and is not dead, so return without additional effort
      • if this returns undefined, the object is dead, and we are the first to notice. Add the vref to the dead set, delete the vref from slotToVal (freeing the WeakRef), and notify the kernel

We use the same WeakRef and Finalizer setup for everything in slotToVal/valToSlot, which means both exported Remotables and imported Presences. The only difference within the vat is that Remotables are also referenced by the exports Set, while Presences are not. This keeps the Remotable alive until a dropLocalExport or dropFormalExport deletes it from exports.

(Another option is to keep a handle for the Finalizer around, and have the delete-from-everything code use it to proactively unregister the Finalizer, rather than ignoring the finalizer when its vref is already gone, but that'd be marginally more complicated)

@warner
Copy link
Member Author

warner commented Dec 10, 2020

At yesterday's meeting, @dtribble reminded me that exported Remotables must keep the same vref independent of local GC behavior. We'd worked that out a few weeks ago, but it dropped out of my most recent writeup. To maintain that property:

  • When informing a vat that an export is no longer needed, the kernel clears the local-in-use flag of the clist entry after translating the derefenced kref into the vat's vref. If the kernel is calling dropFormalExport instead of dropLocalExport, it deletes the c-list entry entirely.
  • The only thing dropLocalExport/dropFormalExport does is to remove the Remotable/Promise from the exports set; it should leave slotToVal and valToSlot untouched
    • (and the only difference between dropLocalExport and dropFormalExport is that the latter goes into the transcript)
  • If/when the vat code drops the Remotable itself, eventually the finalizer will run. Its job is to delete the slotToVal entry (cleaning up the now-dead WeakRef) and notify the kernel.
    • The exported Remotable's old vref (which will always be in the form o+NN rather than the o-NN used for imports) goes into the Dead Set, and the kernel eventually retrieves it (along with dead imports).
    • The kernel checks each dead vref against the c-list. If the local-in-use flag was already clear, then this is not a newly-dead object, and we don't need to do any GC processing on it.

So vats treat Remotables and Presences the same, except that Remotables go into the exports Set, and Presences don't.

If the vat code drops the Remotable before the kernel calls dropLocalExport, the Remotable will be unreachable from the vat code, but still weakly reachable through slotToVal. However, nothing in the kernel will ever mention it again, because that would require something in the kernel or other c-lists to still have a local reference, and dropLocalExport isn't called until the last of those go away. At that point, there will be no strong references left, so eventually the Remotable will transition into "collected" (making it unreachable even through the now-dead WeakRef), and then into "finalized" (causing the finalizer to run, which frees the WeakRef itself).

If instead the vat code retains a reference to the Remotable and re-exports it (after being told to drop the export, the vat will continue to use the same vref as before. This retains a one-to-one relationship between Remotable and vref, regardless of GC behavior. If the export was dropped through dropLocalExport, the c-list entry will still be present (merely the "in local use" flag will be false), so the translation will re-use the old kref too (and sets the flag back to true). Only if the export was dropped through dropFormalExport (i.e. the client vat died) will a new kref be allocated. The difference between "use old kref" and "allocate new kref" depends upon a formal drop, which is not a consequence of GC, which is important because the krefs in use are part of the formal consensus state.

@warner
Copy link
Member Author

warner commented Dec 15, 2020

Sufficiently Deterministic GC

We don't want to overconstrain validators by requiring them to run an exact binary image (and diversity of validators is critical for security anyways, especially against supply-chain attacks, so we very much want each validator to compile their own code from verified upstream sources). But JS is sufficiently underspecified that we need some constraint to make sure that adversarial contract code cannot cause a consensus fault. E.g. we require that all validators run a JS engine whose sort() behavior is observably indistinguishable. We expect this means we'll pick some major version of XS (for which we expect all minor versions to have the same observable behavior), and require that all validators match. In practice, this probably means all validators will compile their own copy of XS from within that major-version range.

Given that existing constraint, we think we can rely upon GC to behave the same across all validators, especially if we slightly weaken our requirements. The idea @dtribble and @erights and I talked through today is:

  • we insist that the engine give us (i.e. the dispatch unit) a way to explicitly perform a precise+complete GC pass
    • Node.js has an --expose-gc flag which adds gc() to the start compartment. In combination with several setImmediate/setTimeout calls to allow the finalizers on the IO queue to run, this might be sufficient, although we don't know what sort of guarantees Node makes about completeness
    • in XS, we expect a C function that accepts the XSMachine structure to trigger this
    • by the time this GC pass completes, all finalizers must be run, and all WeakRefs to unreachable objects must .deref() === undefined
    • we require that the set of now-dereferenced objects be the same among all validators, however the exact order in which finalizers run does not need to be the same
  • we define a consistent point in the kernel cycle to trigger GC, perhaps at the end of each block, on each vat which was touched during that block
  • we use the same bringOutYourDead method as above to collect the newly-unreferenced vref imports, however now the results are formal, which means bringOutYourDead can be a regular transcript-tracked dispatch
  • we remove the distinction between "formal" and "local" GC: all references are formal
    • this lets us remove the local refcount, and the "local-in-use" flag
    • dispatch.dropExport is now formal
      • liveslots reacts to dropExport by deleting both directions of the valToSlot/slotToVal tables
      • if/when the vat re-exports the same Remotable, it will get a new vref, and thus a new kref

By making an explicit gc() call, we hope to make it easier for engines to maintain the same observable behavior: they can still have different heuristics as to when GC happens naturally (in response to memory pressure, etc), but we force everything to be collected just before the one point when we sense the results (via bringOutYourDead). Finalizers might run early, but we don't observe their consequences until bringOutYourDead.

We might require that finalizers don't run until the explicit gc() call, I think this needs more analysis. Our finalizer callback certainly shouldn't notify the kernel: it should only update the dead set. But I need to think about what happens if a vref is re-imported after its weakref is dead, and whether that would create observably different behavior.

@warner
Copy link
Member Author

warner commented Jun 21, 2021

We landed the changes for #3106 and #2615, and we're mostly not trying to pursue a non-deterministic form.

I'm going to leave this open until I can file a ticket with my notes about the remaining non-deterministic handling that we need to do, which is when we take a transcript for a vat that was run under xsnap, and replay it (for debugging) in a local worker, and it's GC behaves differently.

@erights
Copy link
Member

erights commented Jun 22, 2021

Do you expect the gc differences to be observable to unprivileged user code? Is there any way in which gc on is observably different from no gc, to unprivileged user code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-design SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

5 participants