exclude liveslots deserialization code from metering #3667

warner · 2021-08-12T19:40:17Z

We're trying to hedge against XS not necessarily performing
"organic" (non-forced) GC at exactly the same across all members (validators)
of a consensus machine, especially when some members have reloaded a vat from
snapshot (i.e. restarted) recently and others have not. We rely upon
finalizer callbacks only running at explicitly-deterministic times, but we
must still guard against objects becoming collected spontaneously, which will
cause a WeakRef to become "dead" (wr.deref() === undefined) at a random
point in the middle of a turn. Any code which calls wr.deref, or is
conditionally executed/skipped according to the results, is "GC-sensitive".
This includes convertSlotToVal, and therefore m.unserialize.

We cannot allow metering results to diverge between validators, because:

1: it might make the difference between the crank completing successfully,
and the vat being terminated for a per-crank metering fault
2: it will change the large-scale Meter value, which is reported to
userspace
3: it might cause the runPolicy to finish the block earlier on one
validator than on others

all of which would cause a consensus failure.

To prevent this, we run most of the "inbound" side of liveslots without
metering. This includes the first turn of all dispatch.* methods, which
runs entirely within liveslots:

dispatch.deliver performs argument deserialization in the first turn,
then executes user code in the second and subsequent turns
dispatch.notify does the same
the GC deliveries (dispatch.dropExport, etc) only use one turn

We also disable metering when deserializing the return value from
a (synchronous) device call, and when retiring a promise ID (which touches
slotToVal).

Finally, we disable metering for all turns of the post-crank GC finish()
call. This excludes all invocations of the finalizer callbacks, as well as
all the processDeadSet code which is highly sensitive to the results.

closes #3458

FUDCo · 2021-08-13T18:23:54Z

packages/SwingSet/src/kernel/vatManager/supervisor-subprocess-xsnap.js

+    const limit = globalThis.currentMeterLimit();
+    const before = globalThis.resetMeter(0, 0);
+    meteringDisabled += 1;
+    try {
+      return await thunk();
+    } finally {
+      globalThis.resetMeter(limit, before);
+      meteringDisabled -= 1;
+    }


Nervous whether this pattern actually works -- can you get away with scoping async operations like this, even if you use await?

I just checked with @erights and @michaelfig , and the answer is "yes, but don't do it that way". The finally is guaranteed to run, and in our case this pattern would be fine, but their recommendation was to use this instead:

async function runWithoutMeteringAsync(thunk) { const limit = globalThis.currentMeterLimit(); const before = globalThis.resetMeter(0, 0); meteringDisabled += 1; return Promise.resolve() .then(() => thunk()) .finally(() => { globalThis.resetMeter(limit, before); meteringDisabled -= 1; }); }

The difference is that if thunk() throws (synchronously), in my version (return await thunk()) the finally block will get executed in the same turn, but in theirs (using Promise.prototype.finally) the finalizer always runs in its own turn.

I'll update it to use their recommendation.

I just checked with @erights and @michaelfig , and the answer is "yes, but don't do it that way". The finally is guaranteed to run, and in our case this pattern would be fine, but their recommendation was to use this instead:

Just for the record, what did I say "yes" to? The fact that the finally may run in the same or another turn, as you correctly explain, for me means that code does not work.

"Yes" meant that finally { } always runs. even if there's a return or await in the try { }, which I think was the "does JS work the way you seem to think it works" question that @FUDCo was asking. I think we were both uncertain about whether the JS try/finally pattern continues to work after the introduction of async/await, but it does: the general semantics are the same, modulo the timing subtlety you pointed out.

In this case, my code was not depending upon having a turn boundary between the thunk running and the counter being decremented. runWithoutMeteringAsync is not being granted to adversarial code (this is entirely within liveslots), so there's not an adversary to perform reentrancy. The thunk is also friendly code (if it were adversarial, a trivial Promise.then would let it regain agency after the thunk's Promise had returned). I'm all in favor of using a pattern that's easier to review, and I've updated this PR to do so, but in this case it wouldn't have caused a bug to do it the return await thunk() way.

FUDCo

Well OK then

FUDCo · 2021-08-13T18:29:19Z

packages/SwingSet/src/kernel/liveSlots.js

+        // Set.delete() metering seems unaffected by presence/absence, but it
+        // doesn't matter anyway because deadSet.add only happens when
+        // finializers run, which happens deterministically


Seems to me that this comment is not quite right. Finalizers run non-deterministically -- that's why we use deadSet to protect ourselves. It's the processing of deadSet that's run deterministically.

By "finalizers run deterministically" I mean that, given the way xsnap.c uses XS, finalizer callbacks only run during the gcAndFinalize() window. XS only polls the registries during fxEndJob(), and we only call that after the promise queue is empty (

agoric-sdk/packages/xsnap/src/xsnap.c

Line 1108 in cc06a29

for (;;) {

):

void fxRunLoop(txMachine* the) { c_timeval tv; txNumber when; txJob* job; txJob** address; for (;;) { while (the->promiseJobs) { while (the->promiseJobs) { the->promiseJobs = 0; fxRunPromiseJobs(the); } // give finalizers a chance to run after the promise queue is empty fxEndJob(the); // if that added to the promise queue, start again } // at this point the promise queue is empty // .. the rest of the code checks for timer jobs

That's not mandatory: I think the JS spec allows engines to run the finalizer callbacks any time they want (as long as they happen in their own turn). The tricky part, for us, is metering. If the finalizer could happen at any arbitrary time, then it would get metered. The only thing a finalizer does is to examine the weakref and add non-revived vrefs to the deadset, which doesn't cause e.g. immediate syscalls, but even that would cause metering sensitivity to the state of the weakref:

function finalizeDroppedImport(vref) { const wr = slotToVal.get(vref); if (wr && !wr.deref()) { // we're in the COLLECTED state, or FINALIZED after a re-introduction deadSet.add(vref); slotToVal.delete(vref); // console.log(`-- adding ${vref} to deadSet`); } }

So our "metering is insensitive to GC timing" story already depends upon finalizers not running during the userspace phase.

If we were somehow not depending upon that, and finalizers could run at any old time, then a vref might or might not be in the deadSet when this convertValToSlot needed to delete it. We delete it because this might be a re-introduction of the vref: imagine a dispatch.deliver() crank which introduces o-4, then drops it, then performs a device invocation whose return value re-introduces o-4. If the finalizer could run in the middle of the crank, we'd have o-4 in our deadSet when the D() return value was unserialized and convertValToSlot('o-4') creates a new Presence, but o-4 is still in the deadSet. (Now that I think about it, we don't strictly need to delete it from deadSet.. processDeadSet will check the weakref anyways. But I think you and I decided that it would be tidier to delete it as soon as we know it's been re-introduced).

So in that case, deadSet.delete('o-4') would sometimes be deleting something, and sometimes not, depending upon this hypothetical non-deterministic finalizer callback having already run or not. If .delete's metering usage were different in these two situations, then now the metering of convertValToSlot (i.e. m.serialize()) would become sensitive to GC state, which we don't want, because we want serialization to be metered, because (at least) it could invoke user-provided getters, which would allow user code to escape metering.

If that deadSet.delete(slot) were replaced with if (deadSet.has(slot)) { deadSet.delete(slot); }, that'd be obviously metering-affecting. But I wasn't sure about the built-in deadSet.delete().. does it follow differently-metered code paths internally? I did an experiment and convinced myself that, at least right now, it does not: deleting an existing key and deleting a non-existing key both use the same metering.

So we're protected in two ways. 1: deadSet.delete()'s metering doesn't depend upon the presence/absence of the key, and 2: the key won't get added in the middle of a crank (the metered region) because that only happens in the finalizer callback, and our xsnap.c ensures that finalizers only happen at end-of-crank (when we call gcAndFinalize).

I'll update the comment to make it clear that the reason our finalizers run deterministically is because we wrote xsnap.c to make it that way, and that JS in general does not guarantee that in the slightest.

The `meterControl` tool will give liveslots a way to temporarily disable metering, so it can perform GC-sensitive operations without affecting the vat's metering results. Each supervisor provides their own version: all are dummy facades except for XS. refs #3458

We're trying to hedge against XS not necessarily performing "organic" (non-forced) GC at exactly the same across all members (validators) of a consensus machine, especially when some members have reloaded a vat from snapshot (i.e. restarted) recently and others have not. We rely upon finalizer callbacks only running at explicitly-deterministic times, but we must still guard against objects becoming collected spontaneously, which will cause a WeakRef to become "dead" (`wr.deref() === undefined`) at a random point in the middle of a turn. Any code which calls `wr.deref`, or is conditionally executed/skipped according to the results, is "GC-sensitive". This includes `convertSlotToVal`, and therefore `m.unserialize`. We cannot allow metering results to diverge between validators, because: * 1: it might make the difference between the crank completing successfully, and the vat being terminated for a per-crank metering fault * 2: it will change the large-scale Meter value, which is reported to userspace * 3: it might cause the runPolicy to finish the block earlier on one validator than on others all of which would cause a consensus failure. To prevent this, we run most of the "inbound" side of liveslots without metering. This includes the first turn of all `dispatch.*` methods, which runs entirely within liveslots: * `dispatch.deliver` performs argument deserialization in the first turn, then executes user code in the second and subsequent turns * `dispatch.notify` does the same * the GC deliveries (`dispatch.dropExport`, etc) only use one turn We also disable metering when deserializing the return value from a (synchronous) device call, and when retiring a promise ID (which touches `slotToVal`). Finally, we disable metering for all turns of the post-crank GC `finish()` call. This excludes all invocations of the finalizer callbacks, as well as all the `processDeadSet` code which is highly sensitive to the results. closes #3458

For some reason, most of the deserialization tests in test-marshal.js have been creating a full SwingSet controller and then throwing it away. This is an ancient leftover from the PlaygroundVat days, and removing it reduces the test's runtime from about 20 seconds to just 1.

The use of `await` inside a try/finally block would expose a subtle timing issue (if `thunk()` throws synchronously, the finalizer would run in the same turn as the thunk, rather than in a subsequent turn). Our recommended style avoids conditional `await` for just that reason, so this updates `runWithoutMeteringAsync` to use a better approach.

warner added the SwingSet package: SwingSet label Aug 12, 2021

warner added this to the Testnet: Metering Phase milestone Aug 12, 2021

warner requested a review from FUDCo August 12, 2021 19:40

warner self-assigned this Aug 12, 2021

warner mentioned this pull request Aug 12, 2021

fix(swingset): disable metering around GC-sensitive operations #3583

Closed

warner force-pushed the 3458-unmetered-liveslots branch 3 times, most recently from c621af7 to dfeff9e Compare August 13, 2021 01:34

FUDCo reviewed Aug 13, 2021

View reviewed changes

FUDCo approved these changes Aug 13, 2021

View reviewed changes

warner added 4 commits August 13, 2021 14:29

warner force-pushed the 3458-unmetered-liveslots branch from dfeff9e to 64e4f2f Compare August 13, 2021 21:29

warner merged commit 3335547 into master Aug 13, 2021

warner deleted the 3458-unmetered-liveslots branch August 13, 2021 21:52

This was referenced Feb 11, 2024

[Snyk] Fix for 1 vulnerabilities Kashuhackerone/agoric-sdk#269

Open

[Snyk] Fix for 1 vulnerabilities Kashuhackerone/agoric-sdk#270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exclude liveslots deserialization code from metering #3667

exclude liveslots deserialization code from metering #3667

warner commented Aug 12, 2021

FUDCo Aug 13, 2021

warner Aug 13, 2021

erights Aug 13, 2021

warner Aug 13, 2021

FUDCo left a comment

FUDCo Aug 13, 2021

warner Aug 13, 2021

exclude liveslots deserialization code from metering #3667

exclude liveslots deserialization code from metering #3667

Conversation

warner commented Aug 12, 2021

FUDCo Aug 13, 2021

Choose a reason for hiding this comment

warner Aug 13, 2021

Choose a reason for hiding this comment

erights Aug 13, 2021

Choose a reason for hiding this comment

warner Aug 13, 2021

Choose a reason for hiding this comment

FUDCo left a comment

Choose a reason for hiding this comment

FUDCo Aug 13, 2021

Choose a reason for hiding this comment

warner Aug 13, 2021

Choose a reason for hiding this comment