Liveslots should not meter GC-sensitive code paths #3458

warner · 2021-07-09T07:47:40Z

What is the Problem Being Solved?

As described in #3457, we want metering to be stable and consistent across validators, while allowing "organic" GC timing to vary. To achieve this, we need liveslots to put all its code that might interact with GC (processDeadSet, the gcAndFinalize call that allows FinalizationRegistry callbacks to run) inside an "unmetered box", by calling some disable-metering function upon entry, and a re-enable-metering function upon exit.

For this to work, we need the engine to never invoke finalizer callbacks on its own. I think XS (and xsnap.c in particular) gives us this property, but we need to check.

We aren't doing metering on Node.js, so it doesn't matter, but for completeness I'll mention that finalizers only appear to run when the IO queue is allowed to run, so I think they wouldn't run until we do a setImmediate. Unfortunately, we must initiate a setImmediate as the crank begins (to sense when userspace has given up agency), so it's possibly that finalizers will get to run before we have a chance to enter the box.

I'm adding this to the Testnet Metering Phase milestone, although I suspect we'd be ok leaving it out, and instead making GC timing deterministic, at least for now.

Description of the Design

We'll use the callbacks provided by #3457 wrap the following liveslots code in an enter/exit pair:

m.deserialize() (or maybe just convertSlotToVal where it queries slotToVal and the WeakRef therein) during:
- dispatch.deliver argument deserialization
- dispatch.notify resolution value deserialization
- VOM virtual object value getter deserialization
- virtual weak store value deserialization
the body of finish(), which encloses gcAndFinalize (where finalizers might run) and processDeadSet (where the dead set is iterated)

Security Considerations

Test Plan

The text was updated successfully, but these errors were encountered:

warner · 2021-07-15T00:16:53Z

In today's kernel meeting, @dtribble pointed out that our current approach (forcing GC at the end of every crank, but not having snapshot-insensitive GC timing) is insufficient. A Representative that is reachable but untouched during the early parts of a crank, then dropped (e.g. an offer that is dropped in favor of a better one), then resurrected by deserialization later in the crank, might or might not have been collected between the drop and the resurrection. We identified part of the ECMAScript spec (maybe related to tc39/proposal-weakrefs#39 (comment)) that mandates the answer of a weakref.deref() remain stable for some period of time (the language was hard to interpret, but @dtribble concluded it means "until the end of the current turn"), but we decided that wasn't enough to prevent variant execution paths in the middle of the crank.

We decided that it was safe to put virtual object deserialization (and thus kind-constructor invocation) inside the "unmetering box" because we're not yet allowing adversarial contract code, so we're ok with the risk of an infinite loop in the kind constructor causing the vat to cease progress (or stall the entire kernel). The AST-based parser idea in #3462 (comment) could also protect against this: by denying any unusual code from running during the kind constructor, we'd also effectively bound its meter usage to something vaguely O() of the length of the source code.

warner · 2021-07-15T00:48:15Z

@erights also floated the idea of effectively disabling organic GC by changing XS to make all WeakRefs behave like strong references except for forced gc(). We could do this in userspace instead, by:

liveslots maintains a Set of vrefs and a Set of object references
every time we add something to slotToVal, add the vref and the object to the Sets
just before we force gc(), clear the object-reference Set
when we remove something from slotToVal (as a result of deadSet processing), remove it from the Set of vrefs
just after processing deadSet, re-populate the Set of object references by walking the vref Set and looking up the objects in slotToVal

This is probably equivalent to hold = new Set(slotToVal.keys().map(wr => wr.deref())) at the end of processDeadSet and buildLiveSlots, plus hold.clear() just before processDeadSet.

This would allow organic GC of normal objects within the vat, but not of any vref-based objects (Remotable, Presence, Representative).

The cost would be proportional to the number of live vref-based objects in a vat, times the frequency with which we force GC (and thus have to re-populate the hold Set).

When a swingset kernel is part of a consensus machine, the visible state must be a deterministic function of userspace activity. Every member kernel must perform the same set of operations. However we are not yet confident that the timing of garbage collection will remain identical between kernels that experience different patterns of snapshot+restart. In particular, up until recently, the amount of "headroom" in the XS memory allocator was reset upon snapshot reload: the new XS engine only allocates as much RAM as the snapshot needs, whereas before the snapshot was taken, the RAM footprint could have been larger (e.g. if a large number of objects we allocated and then released), leading to more "headroom". Automatic GC is triggered by an attempt to allocate space which cannot be satisfied by this headroom, so it will happen more frequently in the post-reload engine than before the snapshot. To accommodate differences in GC timing between kernels that are otherwise operating in consensus, this commit introduces the "unmetered box": a span of execution that does not count against the meter. We take all of the liveslots operations that might be sensitive to the engine's GC behavior and put them "in" the box. This includes any code that calls `deref()` on the WeakRefs used in `slotToVal`, because the WeakRef can change state from "live" to "dead" when GC is provoked, which is sensitive to more than just userspace behavior. closes #3458