Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveslots should not meter GC-sensitive code paths #3458

Closed
warner opened this issue Jul 9, 2021 · 2 comments · Fixed by #3667
Closed

Liveslots should not meter GC-sensitive code paths #3458

warner opened this issue Jul 9, 2021 · 2 comments · Fixed by #3667
Assignees
Labels
enhancement New feature or request SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jul 9, 2021

What is the Problem Being Solved?

As described in #3457, we want metering to be stable and consistent across validators, while allowing "organic" GC timing to vary. To achieve this, we need liveslots to put all its code that might interact with GC (processDeadSet, the gcAndFinalize call that allows FinalizationRegistry callbacks to run) inside an "unmetered box", by calling some disable-metering function upon entry, and a re-enable-metering function upon exit.

For this to work, we need the engine to never invoke finalizer callbacks on its own. I think XS (and xsnap.c in particular) gives us this property, but we need to check.

We aren't doing metering on Node.js, so it doesn't matter, but for completeness I'll mention that finalizers only appear to run when the IO queue is allowed to run, so I think they wouldn't run until we do a setImmediate. Unfortunately, we must initiate a setImmediate as the crank begins (to sense when userspace has given up agency), so it's possibly that finalizers will get to run before we have a chance to enter the box.

I'm adding this to the Testnet Metering Phase milestone, although I suspect we'd be ok leaving it out, and instead making GC timing deterministic, at least for now.

Description of the Design

We'll use the callbacks provided by #3457 wrap the following liveslots code in an enter/exit pair:

  • m.deserialize() (or maybe just convertSlotToVal where it queries slotToVal and the WeakRef therein) during:
    • dispatch.deliver argument deserialization
    • dispatch.notify resolution value deserialization
    • VOM virtual object value getter deserialization
    • virtual weak store value deserialization
  • the body of finish(), which encloses gcAndFinalize (where finalizers might run) and processDeadSet (where the dead set is iterated)

Security Considerations

Test Plan

@warner warner added enhancement New feature or request SwingSet package: SwingSet labels Jul 9, 2021
@warner warner added this to the Testnet: Metering Phase milestone Jul 9, 2021
@warner warner self-assigned this Jul 9, 2021
@warner
Copy link
Member Author

warner commented Jul 15, 2021

In today's kernel meeting, @dtribble pointed out that our current approach (forcing GC at the end of every crank, but not having snapshot-insensitive GC timing) is insufficient. A Representative that is reachable but untouched during the early parts of a crank, then dropped (e.g. an offer that is dropped in favor of a better one), then resurrected by deserialization later in the crank, might or might not have been collected between the drop and the resurrection. We identified part of the ECMAScript spec (maybe related to tc39/proposal-weakrefs#39 (comment)) that mandates the answer of a weakref.deref() remain stable for some period of time (the language was hard to interpret, but @dtribble concluded it means "until the end of the current turn"), but we decided that wasn't enough to prevent variant execution paths in the middle of the crank.

We decided that it was safe to put virtual object deserialization (and thus kind-constructor invocation) inside the "unmetering box" because we're not yet allowing adversarial contract code, so we're ok with the risk of an infinite loop in the kind constructor causing the vat to cease progress (or stall the entire kernel). The AST-based parser idea in #3462 (comment) could also protect against this: by denying any unusual code from running during the kind constructor, we'd also effectively bound its meter usage to something vaguely O() of the length of the source code.

@warner
Copy link
Member Author

warner commented Jul 15, 2021

@erights also floated the idea of effectively disabling organic GC by changing XS to make all WeakRefs behave like strong references except for forced gc(). We could do this in userspace instead, by:

  • liveslots maintains a Set of vrefs and a Set of object references
  • every time we add something to slotToVal, add the vref and the object to the Sets
  • just before we force gc(), clear the object-reference Set
  • when we remove something from slotToVal (as a result of deadSet processing), remove it from the Set of vrefs
  • just after processing deadSet, re-populate the Set of object references by walking the vref Set and looking up the objects in slotToVal

This is probably equivalent to hold = new Set(slotToVal.keys().map(wr => wr.deref())) at the end of processDeadSet and buildLiveSlots, plus hold.clear() just before processDeadSet.

This would allow organic GC of normal objects within the vat, but not of any vref-based objects (Remotable, Presence, Representative).

The cost would be proportional to the number of live vref-based objects in a vat, times the frequency with which we force GC (and thus have to re-populate the hold Set).

warner added a commit that referenced this issue Aug 3, 2021
When a swingset kernel is part of a consensus machine, the visible state must
be a deterministic function of userspace activity. Every member kernel must
perform the same set of operations.

However we are not yet confident that the timing of garbage collection will
remain identical between kernels that experience different patterns of
snapshot+restart. In particular, up until recently, the amount of "headroom"
in the XS memory allocator was reset upon snapshot reload: the new XS engine
only allocates as much RAM as the snapshot needs, whereas before the snapshot
was taken, the RAM footprint could have been larger (e.g. if a large number
of objects we allocated and then released), leading to more "headroom".
Automatic GC is triggered by an attempt to allocate space which cannot be
satisfied by this headroom, so it will happen more frequently in the
post-reload engine than before the snapshot.

To accommodate differences in GC timing between kernels that are otherwise
operating in consensus, this commit introduces the "unmetered box": a span of
execution that does not count against the meter. We take all of the liveslots
operations that might be sensitive to the engine's GC behavior and put them
"in" the box.

This includes any code that calls `deref()` on the WeakRefs used in
`slotToVal`, because the WeakRef can change state from "live" to "dead" when
GC is provoked, which is sensitive to more than just userspace behavior.

closes #3458
warner added a commit that referenced this issue Aug 3, 2021
When a swingset kernel is part of a consensus machine, the visible state must
be a deterministic function of userspace activity. Every member kernel must
perform the same set of operations.

However we are not yet confident that the timing of garbage collection will
remain identical between kernels that experience different patterns of
snapshot+restart. In particular, up until recently, the amount of "headroom"
in the XS memory allocator was reset upon snapshot reload: the new XS engine
only allocates as much RAM as the snapshot needs, whereas before the snapshot
was taken, the RAM footprint could have been larger (e.g. if a large number
of objects we allocated and then released), leading to more "headroom".
Automatic GC is triggered by an attempt to allocate space which cannot be
satisfied by this headroom, so it will happen more frequently in the
post-reload engine than before the snapshot.

To accommodate differences in GC timing between kernels that are otherwise
operating in consensus, this commit introduces the "unmetered box": a span of
execution that does not count against the meter. We take all of the liveslots
operations that might be sensitive to the engine's GC behavior and put them
"in" the box.

This includes any code that calls `deref()` on the WeakRefs used in
`slotToVal`, because the WeakRef can change state from "live" to "dead" when
GC is provoked, which is sensitive to more than just userspace behavior.

closes #3458
warner added a commit that referenced this issue Aug 6, 2021
When a swingset kernel is part of a consensus machine, the visible state must
be a deterministic function of userspace activity. Every member kernel must
perform the same set of operations.

However we are not yet confident that the timing of garbage collection will
remain identical between kernels that experience different patterns of
snapshot+restart. In particular, up until recently, the amount of "headroom"
in the XS memory allocator was reset upon snapshot reload: the new XS engine
only allocates as much RAM as the snapshot needs, whereas before the snapshot
was taken, the RAM footprint could have been larger (e.g. if a large number
of objects we allocated and then released), leading to more "headroom".
Automatic GC is triggered by an attempt to allocate space which cannot be
satisfied by this headroom, so it will happen more frequently in the
post-reload engine than before the snapshot.

To accommodate differences in GC timing between kernels that are otherwise
operating in consensus, this commit introduces the "unmetered box": a span of
execution that does not count against the meter. We take all of the liveslots
operations that might be sensitive to the engine's GC behavior and put them
"in" the box.

This includes any code that calls `deref()` on the WeakRefs used in
`slotToVal`, because the WeakRef can change state from "live" to "dead" when
GC is provoked, which is sensitive to more than just userspace behavior.

closes #3458
warner added a commit that referenced this issue Aug 12, 2021
The `meterControl` tool will give liveslots a way to temporarily disable
metering, so it can perform GC-sensitive operations without affecting the
vat's metering results.

Each supervisor provides their own version: all are dummy facades except for
XS.

refs #3458
warner added a commit that referenced this issue Aug 12, 2021
We're trying to hedge against XS not necessarily performing
"organic" (non-forced) GC at exactly the same across all members (validators)
of a consensus machine, especially when some members have reloaded a vat from
snapshot (i.e. restarted) recently and others have not. We rely upon
finalizer callbacks only running at explicitly-deterministic times, but we
must still guard against objects becoming collected spontaneously, which will
cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random
point in the middle of a turn. Any code which calls `wr.deref`, or is
conditionally executed/skipped according to the results, is "GC-sensitive".
This includes `convertSlotToVal`, and therefore `m.unserialize`.

We cannot allow metering results to diverge between validators, because:

* 1: it might make the difference between the crank completing successfully,
and the vat being terminated for a per-crank metering fault
* 2: it will change the large-scale Meter value, which is reported to
userspace
* 3: it might cause the runPolicy to finish the block earlier on one
validator than on others

all of which would cause a consensus failure.

To prevent this, we run most of the "inbound" side of liveslots without
metering. This includes the first turn of all `dispatch.*` methods, which
runs entirely within liveslots:

* `dispatch.deliver` performs argument deserialization in the first turn,
  then executes user code in the second and subsequent turns
* `dispatch.notify` does the same
* the GC deliveries (`dispatch.dropExport`, etc) only use one turn

We also disable metering when deserializing the return value from
a (synchronous) device call, and when retiring a promise ID (which touches
`slotToVal`).

Finally, we disable metering for all turns of the post-crank GC `finish()`
call. This excludes all invocations of the finalizer callbacks, as well as
all the `processDeadSet` code which is highly sensitive to the results.

closes #3458
warner added a commit that referenced this issue Aug 13, 2021
The `meterControl` tool will give liveslots a way to temporarily disable
metering, so it can perform GC-sensitive operations without affecting the
vat's metering results.

Each supervisor provides their own version: all are dummy facades except for
XS.

refs #3458
warner added a commit that referenced this issue Aug 13, 2021
We're trying to hedge against XS not necessarily performing
"organic" (non-forced) GC at exactly the same across all members (validators)
of a consensus machine, especially when some members have reloaded a vat from
snapshot (i.e. restarted) recently and others have not. We rely upon
finalizer callbacks only running at explicitly-deterministic times, but we
must still guard against objects becoming collected spontaneously, which will
cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random
point in the middle of a turn. Any code which calls `wr.deref`, or is
conditionally executed/skipped according to the results, is "GC-sensitive".
This includes `convertSlotToVal`, and therefore `m.unserialize`.

We cannot allow metering results to diverge between validators, because:

* 1: it might make the difference between the crank completing successfully,
and the vat being terminated for a per-crank metering fault
* 2: it will change the large-scale Meter value, which is reported to
userspace
* 3: it might cause the runPolicy to finish the block earlier on one
validator than on others

all of which would cause a consensus failure.

To prevent this, we run most of the "inbound" side of liveslots without
metering. This includes the first turn of all `dispatch.*` methods, which
runs entirely within liveslots:

* `dispatch.deliver` performs argument deserialization in the first turn,
  then executes user code in the second and subsequent turns
* `dispatch.notify` does the same
* the GC deliveries (`dispatch.dropExport`, etc) only use one turn

We also disable metering when deserializing the return value from
a (synchronous) device call, and when retiring a promise ID (which touches
`slotToVal`).

Finally, we disable metering for all turns of the post-crank GC `finish()`
call. This excludes all invocations of the finalizer callbacks, as well as
all the `processDeadSet` code which is highly sensitive to the results.

closes #3458
warner added a commit that referenced this issue Aug 13, 2021
The `meterControl` tool will give liveslots a way to temporarily disable
metering, so it can perform GC-sensitive operations without affecting the
vat's metering results.

Each supervisor provides their own version: all are dummy facades except for
XS.

refs #3458
warner added a commit that referenced this issue Aug 13, 2021
We're trying to hedge against XS not necessarily performing
"organic" (non-forced) GC at exactly the same across all members (validators)
of a consensus machine, especially when some members have reloaded a vat from
snapshot (i.e. restarted) recently and others have not. We rely upon
finalizer callbacks only running at explicitly-deterministic times, but we
must still guard against objects becoming collected spontaneously, which will
cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random
point in the middle of a turn. Any code which calls `wr.deref`, or is
conditionally executed/skipped according to the results, is "GC-sensitive".
This includes `convertSlotToVal`, and therefore `m.unserialize`.

We cannot allow metering results to diverge between validators, because:

* 1: it might make the difference between the crank completing successfully,
and the vat being terminated for a per-crank metering fault
* 2: it will change the large-scale Meter value, which is reported to
userspace
* 3: it might cause the runPolicy to finish the block earlier on one
validator than on others

all of which would cause a consensus failure.

To prevent this, we run most of the "inbound" side of liveslots without
metering. This includes the first turn of all `dispatch.*` methods, which
runs entirely within liveslots:

* `dispatch.deliver` performs argument deserialization in the first turn,
  then executes user code in the second and subsequent turns
* `dispatch.notify` does the same
* the GC deliveries (`dispatch.dropExport`, etc) only use one turn

We also disable metering when deserializing the return value from
a (synchronous) device call, and when retiring a promise ID (which touches
`slotToVal`).

Finally, we disable metering for all turns of the post-crank GC `finish()`
call. This excludes all invocations of the finalizer callbacks, as well as
all the `processDeadSet` code which is highly sensitive to the results.

closes #3458
warner added a commit that referenced this issue Aug 13, 2021
The `meterControl` tool will give liveslots a way to temporarily disable
metering, so it can perform GC-sensitive operations without affecting the
vat's metering results.

Each supervisor provides their own version: all are dummy facades except for
XS.

refs #3458
warner added a commit that referenced this issue Aug 13, 2021
We're trying to hedge against XS not necessarily performing
"organic" (non-forced) GC at exactly the same across all members (validators)
of a consensus machine, especially when some members have reloaded a vat from
snapshot (i.e. restarted) recently and others have not. We rely upon
finalizer callbacks only running at explicitly-deterministic times, but we
must still guard against objects becoming collected spontaneously, which will
cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random
point in the middle of a turn. Any code which calls `wr.deref`, or is
conditionally executed/skipped according to the results, is "GC-sensitive".
This includes `convertSlotToVal`, and therefore `m.unserialize`.

We cannot allow metering results to diverge between validators, because:

* 1: it might make the difference between the crank completing successfully,
and the vat being terminated for a per-crank metering fault
* 2: it will change the large-scale Meter value, which is reported to
userspace
* 3: it might cause the runPolicy to finish the block earlier on one
validator than on others

all of which would cause a consensus failure.

To prevent this, we run most of the "inbound" side of liveslots without
metering. This includes the first turn of all `dispatch.*` methods, which
runs entirely within liveslots:

* `dispatch.deliver` performs argument deserialization in the first turn,
  then executes user code in the second and subsequent turns
* `dispatch.notify` does the same
* the GC deliveries (`dispatch.dropExport`, etc) only use one turn

We also disable metering when deserializing the return value from
a (synchronous) device call, and when retiring a promise ID (which touches
`slotToVal`).

Finally, we disable metering for all turns of the post-crank GC `finish()`
call. This excludes all invocations of the finalizer callbacks, as well as
all the `processDeadSet` code which is highly sensitive to the results.

closes #3458
warner added a commit that referenced this issue Aug 13, 2021
The `meterControl` tool will give liveslots a way to temporarily disable
metering, so it can perform GC-sensitive operations without affecting the
vat's metering results.

Each supervisor provides their own version: all are dummy facades except for
XS.

refs #3458
warner added a commit that referenced this issue Aug 13, 2021
We're trying to hedge against XS not necessarily performing
"organic" (non-forced) GC at exactly the same across all members (validators)
of a consensus machine, especially when some members have reloaded a vat from
snapshot (i.e. restarted) recently and others have not. We rely upon
finalizer callbacks only running at explicitly-deterministic times, but we
must still guard against objects becoming collected spontaneously, which will
cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random
point in the middle of a turn. Any code which calls `wr.deref`, or is
conditionally executed/skipped according to the results, is "GC-sensitive".
This includes `convertSlotToVal`, and therefore `m.unserialize`.

We cannot allow metering results to diverge between validators, because:

* 1: it might make the difference between the crank completing successfully,
and the vat being terminated for a per-crank metering fault
* 2: it will change the large-scale Meter value, which is reported to
userspace
* 3: it might cause the runPolicy to finish the block earlier on one
validator than on others

all of which would cause a consensus failure.

To prevent this, we run most of the "inbound" side of liveslots without
metering. This includes the first turn of all `dispatch.*` methods, which
runs entirely within liveslots:

* `dispatch.deliver` performs argument deserialization in the first turn,
  then executes user code in the second and subsequent turns
* `dispatch.notify` does the same
* the GC deliveries (`dispatch.dropExport`, etc) only use one turn

We also disable metering when deserializing the return value from
a (synchronous) device call, and when retiring a promise ID (which touches
`slotToVal`).

Finally, we disable metering for all turns of the post-crank GC `finish()`
call. This excludes all invocations of the finalizer callbacks, as well as
all the `processDeadSet` code which is highly sensitive to the results.

closes #3458
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet
Projects
None yet
1 participant