give liveslots access to setImmediate, for GC #2243

warner · 2021-01-22T18:21:58Z

What is the Problem Being Solved?

I have a plan to let vats use a non-transcripted syscall.dropImport to signal both deterministic/deliberate drops and WeakRef-based ones. To implement it, liveslots will need to know when the user-level vat code has lost agency, which means it needs access to setImmediate (or more likely the waitUntilQuiescent wrapper). This will rearrange some of the responsiblity for knowing that a crank has ended, but into a way that fits better with the various worker types.

The idea is that liveslots is responsible for managing WeakRefs (in the slotToVal table) and the FinalizationRegistry used to detect deletions, and those notifications always come in their own turn, and liveslots needs to (eventually, at the right time) react to those notifications by making dropImport syscalls.

My plan is for the crank cycle to look like:

liveslots gets control via dispatch.deliver or dispatch.notify (the future dispatch.dropExport is the same, but won't result in the user-level vat code getting control)
- in this approach, dispatch.* functions return a Promise, and the supervisor relies upon it to know when the crank is done, rather than using setImmediate/waitUntilQuiescent itself
liveslots deserializes the arguments, updating slotToVal/valToSlot in the process for any new imports
liveslots invokes user-level vat code, by invoking a method (dispatch.deliver) or resolving one or more Promises (dispatch.notify)
- while user code is running, more syscalls will be made, including syscall.dropImport for vats which have a way to voluntarily give up access to an import (e.g. the comms vat, maybe the referenced imports of explicitly-deleted virtual objects)
liveslots waits until user code has gone idle ("lost agency"), with setImmediate
(on some platforms)
- liveslots invokes a gc() function provided by its supervisor, to trigger an engine-level GC sweep
- liveelots waits some more, to allow FinalizationRegistry callbacks to run. In my experiments on Node.js, this required two setImmediate cycles, and/or maybe a setTimeout(epsilon), I need to re-investigate
  - maybe gc() should return a Promise, and encapsulate any necessary stalls
liveslots checks the set of dereferenced imports accumulated by the FR callbacks, sorts them somehow (to improve determinism), and performs syscall.dropImport for each
liveslots resolves the dispatch.* return Promise, letting the supervisor know the crank is done

The bigger picture is a collection of the kernel, some VatManagers (of various types), their associated worker processes (perhaps local, perhaps in a child process), the Supervisors in those workers, the liveslots layer, and the user-level vat code.

the kernel pulls an item off the run-queue, figures out which vat it is destined for, creates a KernelDeliveryObject, translates it (through the kernel's c-lists for that vat) into a VatDeliveryObject, hands it to the right VatManager (which returns a Promise for when the delivery is complete)
the VatManager somehow conveys the VatDeliveryObject (which is pure serializable data) to the worker, which might mean serializing it over a message pipe, or handing it to a local function
the worker somehow receives this VatDeliveryObject, deserializing it if necessary, giving it to the supervisor
the supervisor configures/resets the meters, somehow
the supervisor enables liveslot's syscall object
the supervisor invokes liveslot's dispatch.* method, which returns a Promise for when liveslots is done
when the supervisor sees that Promise resolve, it disables the syscall object, consults the meters for underflow or remaining computrons, and sends the crank results back to the parent (which might need to be serialized by the worker, to send over the message pipe)
when the VatManager receives the crank results from its worker, it resolves the Promise it gave to the kernel
when the kernel sees that Promise resolved, it either commits the crank (for success) or rolls it back (for failure)
the kernel loops back to the next run-queue item, or gives the host application the option of continuing or finishing a block

The authority/reliance allocation is:

the VatManager can do anything the worker can do, plus crash the kernel
the worker can do anything liveslots can do, plus crash the worker, violate metering, be nondeterministic
liveslots can do anything the vat can do (send messages to any object exposed to the vat), plus be nondeterministic, plus provoke a GC sweep using whatever tool the supervisor gives it
the user-level buildRootObject can use vatPowers to read/write the per-vat offline storage (enabling a non-ocap communication channel), maybe a few other minor powers
user-level objects are limited by normal ocap discipline

For XS-based workers, the gc() tool will only provoke a sweep of the one engine. (Our current xsnap approach only has one engine per process, but the long-term picture will have multiple). For Node.js we might have it provoke GC across the entire kernel-plus-local-workers process, or make it a no-op, depending upon what the performance consequences are. The issue is that frequent GC is less efficient than batching it, but rarely-visited vats might have dead objects that can't be reported until they get control again. User-level code will only lose reachability to objects in response to a delivery being made (including dropExport), which happens in a crank, so the best time to discover those drops is just afterwards (with an intervening gc() call to prod the engine into finding out). If we knew for sure that there was another delivery to this vat coming up, we could defer the gc() and amortize the cost, but we don't generally have way to do that (maaaybe something in the kernel that keeps track of lonely vats and sends them a "hey, haven't talked in a while, I have no work for you, but do you maybe have any garbage from before that we should clean up" message.. we could do this just before evicting them into a snapshot, but that wouldn't help with memory/object footprint before eviction).

Alternate Approaches

My earlier thinking in #1872 used a separate special dispatch.bringOutYourDead call, which returns a list of dereferenced vrefs. I think we need a syscall form of this, both for the comms vat (which knows exactly when an import is no longer referenced, which is just after it receives a dropImport from the last remote system, modulo local promise resolutions that might also maintain a reference), and for vats that can somehow deliberately drop imports (like a virtual object that is explicitly deleted, releasing any data it contained). Once we have a dropImport syscall for that purpose, it makes sense to have the bring-out-your-dead phase use it too, rather than a secondary pathway in the return value.

The purpose of an explicit bringOutYourDead is to avoid hearing about dropImports from Vat A while we're actually running a crank on Vat B. One alternative would be for the kernel/supervisor/something to follow every delivery with a gc() sweep and a bringOutYourDead call, which would avoid needing to give either gc() or setImmediate() to liveslots. This could leave the prompt-vs-efficient tradeoff to something higher up, which might be better. But the manager/worker is what knows the vat's JS engine the best, and it would involve three phases instead of just one, which feels more complicated.

#1872 also proposes a layering of notifications: vats could notify the kernel that they have some garbage to collect (without naming which vrefs) at any time, and the kernel reacts by scheduling a bringOutYourDead call at some point in the future. That gives the kernel control over the prompt-vs-efficient tradeoff, but doesn't provide a good story for when gc() should be provoked. In Node.js we can probably rely upon the automatic occasional gc() call to manage local heap space, but uncollected garbage also keeps objects alive in other vats, and on remote machines, so we might want to learn towards promptness over efficiency. In an XS worker, each worker has its own engine, so the kernel has less visiblity into when it might be appropriate to provoke GC on each one.

Description of the Design

Security Considerations

Test Plan

The text was updated successfully, but these errors were encountered:

warner added enhancement New feature or request SwingSet package: SwingSet labels Jan 22, 2021

warner self-assigned this Jan 22, 2021

warner mentioned this issue Jan 22, 2021

feat(swingset): xsnap vat worker #2225

Merged

7 tasks

warner mentioned this issue May 16, 2021

The SwingSet GC Plan (save RAM for unreachable objects) #3106

Closed

9 tasks

warner mentioned this issue May 24, 2021

feat(swingset): drop Presences, trigger syscall.dropImports #3164

Merged

warner closed this as completed in #3164 May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

give liveslots access to setImmediate, for GC #2243

give liveslots access to setImmediate, for GC #2243

warner commented Jan 22, 2021 •

edited

Loading

give liveslots access to setImmediate, for GC #2243

give liveslots access to setImmediate, for GC #2243

Comments

warner commented Jan 22, 2021 • edited Loading

What is the Problem Being Solved?

Alternate Approaches

Description of the Design

Security Considerations

Test Plan

warner commented Jan 22, 2021 •

edited

Loading