Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of garbage collection of resolved kernel promises #1049

Closed
FUDCo opened this issue May 4, 2020 · 6 comments
Closed

Discussion of garbage collection of resolved kernel promises #1049

FUDCo opened this issue May 4, 2020 · 6 comments
Labels
bug Something isn't working SwingSet package: SwingSet

Comments

@FUDCo
Copy link
Contributor

FUDCo commented May 4, 2020

Vat Bob:

function build(E, log) {
  let p1, r1;
  let p2, r2;
  return {
    genPromise1() {
      [p1, r1] = makePR();
      return p1;
    },
    genPromise2() {
      [p2, r2] = makePR();
      return p2;
    },
    usePromises(pa, pb) {
      r1(pb);
      r2(pa);
    },
  };
}

Vat Alice:

function build(E, log) {
  return {
    bootstrap(argv, vats) {
      const pa = E(vats.bob).genPromise1();
      const pb = E(vats.bob).genPromise2();
      const pc = E(vats.bob).usePromises([pa], [pb]);
    },
  };
}

Fourth crank, where the usePromises message gets delivered, yields this in the console log:

processQ {"type":"send","target":"ko21","msg":{"method":"usePromises","args":{"body":"[[{\"@qclass\":\"slot\",\"index\":0}],[{\"@qclass\":\"slot\",\"index\":1}]]","slots":["kp40","kp41"]},"result":"kp42"}}
@ko21 <- usePromises([@kp40], [@kp41]) : @kp42
Add mapping k->v v1.c.kp42<=>v1.c.p-62
syscall[v1].subscribe(vat:p-60=ker:kp40)
syscall[v1].subscribe(vat:p-61=ker:kp41)
syscall[v1].fulfillData(p-62/kp42) = {"@qclass":"undefined"} []/[]
Delete mapping v1.c.kp42<=>v1.c.p-62
syscall[v1].fulfillData(p-60/kp40) = [{"@qclass":"slot","index":0}] ["p-61"]/["kp41"]
Delete mapping v1.c.kp40<=>v1.c.p-60
Adding kernel promise kp43 for v1
Add mapping v->k v1.c.kp43<=>v1.c.p+5
syscall[v1].fulfillData(p-61/kp41) = [{"@qclass":"slot","index":0}] ["p+5"]/["kp43"]
Delete mapping v1.c.kp41<=>v1.c.p-61
Adding kernel promise kp44 for v1
Add mapping v->k v1.c.kp44<=>v1.c.p+6
syscall[v1].fulfillData(p+5/kp43) = [{"@qclass":"slot","index":0}] ["p+6"]/["kp44"]
Delete mapping v1.c.kp43<=>v1.c.p+5
Adding kernel promise kp45 for v1
Add mapping v->k v1.c.kp45<=>v1.c.p+7
syscall[v1].fulfillData(p+6/kp44) = [{"@qclass":"slot","index":0}] ["p+7"]/["kp45"]
Delete mapping v1.c.kp44<=>v1.c.p+6
Adding kernel promise kp46 for v1
Add mapping v->k v1.c.kp46<=>v1.c.p+8
syscall[v1].fulfillData(p+7/kp45) = [{"@qclass":"slot","index":0}] ["p+8"]/["kp46"]
Delete mapping v1.c.kp45<=>v1.c.p+7
Adding kernel promise kp47 for v1
Add mapping v->k v1.c.kp47<=>v1.c.p+9
syscall[v1].fulfillData(p+8/kp46) = [{"@qclass":"slot","index":0}] ["p+9"]/["kp47"]
Delete mapping v1.c.kp46<=>v1.c.p+8
Adding kernel promise kp48 for v1
Add mapping v->k v1.c.kp48<=>v1.c.p+10
syscall[v1].fulfillData(p+9/kp47) = [{"@qclass":"slot","index":0}] ["p+10"]/["kp48"]
Delete mapping v1.c.kp47<=>v1.c.p+9
Adding kernel promise kp49 for v1
Add mapping v->k v1.c.kp49<=>v1.c.p+11
syscall[v1].fulfillData(p+10/kp48) = [{"@qclass":"slot","index":0}] ["p+11"]/["kp49"]
Delete mapping v1.c.kp48<=>v1.c.p+10
Adding kernel promise kp50 for v1
Add mapping v->k v1.c.kp50<=>v1.c.p+12
syscall[v1].fulfillData(p+11/kp49) = [{"@qclass":"slot","index":0}] ["p+12"]/["kp50"]
Delete mapping v1.c.kp49<=>v1.c.p+11
Adding kernel promise kp51 for v1
Add mapping v->k v1.c.kp51<=>v1.c.p+13
syscall[v1].fulfillData(p+12/kp50) = [{"@qclass":"slot","index":0}] ["p+13"]/["kp51"]
Delete mapping v1.c.kp50<=>v1.c.p+12
Adding kernel promise kp52 for v1
Add mapping v->k v1.c.kp52<=>v1.c.p+14
syscall[v1].fulfillData(p+13/kp51) = [{"@qclass":"slot","index":0}] ["p+14"]/["kp52"]
Delete mapping v1.c.kp51<=>v1.c.p+13
Adding kernel promise kp53 for v1
Add mapping v->k v1.c.kp53<=>v1.c.p+15
syscall[v1].fulfillData(p+14/kp52) = [{"@qclass":"slot","index":0}] ["p+15"]/["kp53"]
Delete mapping v1.c.kp52<=>v1.c.p+14
Adding kernel promise kp54 for v1

... repeats until interrupted from the outside, with the promise IDs incrementing the whole time.

This appears to be a liveslots issue. Aside from whatever the underlying bug is, this points out the need for the liveslots code to be subjected to metering the same as everything else that runs in a vat.

Note that the single-promise analog of this, where Bob is:

function build(E, log) {
  let p, r;
  return {
    genPromise() {
      [p, r] = makePR();
      return p;
    },
    usePromise(pa) {
      r(pa);
    },
  };
}

and Alice is:

function build(E, log) {
  return {
    bootstrap(argv, vats) {
      const pa = E(vats.bob).genPromise();
      const pb = E(vats.bob).usePromise([pa]);
    },
  };
}

Terminates successfully after 5 cranks with a single kernel promise whose value refers to itself in an array:

kp40.data.body :: [{"@qclass":"slot","index":0}]
kp40.data.slots :: kp40
kp40.refCount :: 1
kp40.state :: fulfilledToData

This is the kind of unreferenced loop that we don't expect reference counts to catch.

@FUDCo FUDCo added bug Something isn't working SwingSet package: SwingSet labels May 4, 2020
@warner
Copy link
Member

warner commented May 5, 2020

Chip and I talked this through today. Some notes:

The fundamental difficulty is that we're trying to make two claims at the same time:

  • 1: userspace can create arbitrary object graphs, with cycles(!), and transmit them (pass-by-copy), and expect that the remote side gets something functionally equivalent (for a particular value of "equivalent" that we can nail down more precisely)
  • 2: we do not promise (ahem) to retain the identity of resolved Promise objects

We think the problem he found arose from liveslots trying to serialize a cycle without being allowed to have identities that remained stable for the entire duration of the serialization. It was asked to resolve p1 to [p2], which it did just fine, and retired the current VPID of p1 in the process. Then it was asked to resolve p2 to [p1], for which it had to create a new VPID (call it "p1new") for p1 (since it dropped the old one when p1 was resolved). The creation of the new VPID p1new requires a p1.then to be attached, so it can tell the kernel when p1 resolves, since this is a new VPID, and liveslots doesn't know that it's already resolved. When this then callback fires, it sees that it resolved to [p2], which has to be serialized, but the VPID for p2 has been retired by this point. And the cycle repeats, forever.

Serializing any cyclic data structure requires some notion of pointers or identities that can be compared to break the cycles. These identities must be stable until we've finished traversing the graph. In liveslots terms, this means we can't retire the VPIDs until we've finished all the resolves we're doing in a given crank.

Liveslots doesn't currently get to know when the crank is done. We deny it access to setImmediate and the IO/timer queue, specifically so that the kernel can know, for sure, when the vat code has lost agency. The kernel must not proceed to other operations until it knows this vat can no longer get runtime without receiving another dispatch call. "don't speak until spoken to", or "hold still while I'm looking elsewhere".

As a result, liveslots doesn't know when it becomes safe to retire the VPIDs.

Our tentative plan is:

  • have liveslots accumulate these ready-to-be-retired VPIDs in a list somewhere
    • each time liveslots calls syscall.fulfill*, it adds the newly-resolved promise to the list
    • but it leaves them in the table (valToSlot and slotToVal) in case subsequent serializations in the same turn, including resolutions, might reference them
  • after the kernel sees its setImmediate fire, marking the end of the main portion of the crank, the kernel makes a second dispatch into the vat, maybe dispatch.endOfCrank
    • the kernel uses setImmediate again, to know when the vat has lost agency
  • liveslots uses this endOfCrank to invoke a new syscall.dropSlots([]), passing all the VPIDs (and potentially object ids) that it accumulated
    • at the same time, liveslots removes those entries from its tables
    • the kernel removes those entries from the c-lists upon receipt
  • liveslots is not particularly expected to make other syscalls during this second phase, but we didn't think of a reason to prohibit it
  • liveslots is not expected to give userspace code a chance to run here, which reduces the set of things liveslots might have to deal with

We might need some new terminology. Each "crank" is now followed by a second phase, maybe the "post-crank cleanup" or something. This second phase has the same kind of timing as the normal crank (it consists of one or more turns, and ends when the promise queue is empty), but it has a second dispatch call. The first dispatch call is always a dispatch.send or one of the dispatch.notifyfulfill* flavors. The second is always dispatch.endOfCrank.

Some alternatives/variants we discussed:

  • I (warner) kind of like the idea of the kernel doing deadSlots = dispatch.bringOutYourDead() for the second phase: return a list of dead slots, rather than making syscalls to itemize them. Partly for the pun, partly because I think it might tie in to a future GC point based upon serializing the VM and restoring it, or on some funky WeakRef-based GC opportunity. This post-crank second phase is a chance for liveslots to get control without the user-level code having control, which might be a good time to allow finalizers to run, or reveal the results of their previous execution. Or we might periodically save and reload the entire vat, and learn what slots are unreachable during the reload process, so the first thing we call after reload is bringOutYourDead() to see what we can clean up.
    • but Chip preferred the dispatch+syscall approach
  • Chip felt it was mostly equivalent to do this second phase either at the end of the crank, or at the beginning of the next crank, and he did a test where liveslots just did the appropriate cleanup at the beginning of the next crank without an explicit dispatch operation (i.e. at the beginning of every dispatch, check for accumulated resolved VPIDs and remove them from the table). In this approach, the kernel removes those entries immediately, and the vat removes them at the end of the next crank.
    • I felt that it's better to do it at the end of the crank, rather than at the beginning of the next crank. I want to leave the vat in a clean state. "deal with your problems before going to sleep, rather than waking up into an immediate nightmare". But it's not a huge difference either way.
  • the synchronous return of a dispatch.bringOutYourDead() would be a bit new/unusual/gross, we don't currently pay attention to the result of a dispatch (other than logging any error thrown by the first turn, which is handy for a lot of common errors). But it's not as significant as synchronous returns from syscalls.

We'll talk this through with @dtribble tomorrow. I suspect his intuitions on easier ways to solve this might depend upon properties of Midori or E in which resolved promises are effectively identical to the thing they resolve to, which (for various reasons) is not what we're currently doing in swingset.

@FUDCo
Copy link
Contributor Author

FUDCo commented May 5, 2020

A couple of additional points on this:

  • while it seems satisfactory to me for liveslots to do its cleanup at the start of the next crank, this won't work for cleaning up the c-lists on the kernel side of the vat/kernel boundary. The reason is that at the point the vat gets control at the start of the crank, from the kernel's perspective it's too late: the kernel has already deserialized the message it's delivering into the vat, and this deserialization might make reference to what should be a dead c-list entry. The c-lists must be cleaned up between cranks in order to ensure that any promise reference entering the vat is either a reference that is already live inside the vat or a new import into the vat.

  • I'm not actually wild about the dispatch+syscall approach. In fact the variant I was proposing was for liveslots and the kernel to each take responsibility for cleaning up their own stuff (which is pretty much how things work prior to encountering the problem we are discussing here), but with the added complication that each has to hold off on any promise retirement activity until the vat has quiesced at the end of the crank (which is what doing the vat-side cleanup at the start of the next crank accomplished, albeit at the cost of having to invoke the cleanup at the front of each dispatch entry point (deliver, fulfillToData, fulfillToPresence, and reject), which is messy from a separation of concerns perspective and also an opportunity to make a mistake if we add new dispatch entry points in the future. However, it's possible we could augment the dispatch mechanism itself to take care of this). On the kernel side, we're already looking at having to do some kind of "save up changes until the end of the crank, then do'em in a batch" thing with respect to kernel promise reference counting, so I'm less worried about that. There is a concern that having the kernel infer when it's safe to throw away stuff it believes about the vat is opening ourselves up to a problem with the coupling between vat and kernel being tighter than maybe it should be though.

  • All that said, since I was already 90% of the way to having most of the machinery for a "save changes, cleanup at crank end" on the kernel side already for other reasons, just doing that as a quick-and-dirty check on our intuitions seems like a worthwhile expenditure of effort (it's only a few minutes of coding, I think). If our intuitions are correct then we can go ahead and implement a cleaner solution (one of the approaches Brian outlined above), but if our intuitions are wrong we'll fail fast.

@dtribble
Copy link
Member

dtribble commented May 5, 2020

Three options to generally resolve this

  1. drop references that have no embedded promises
  2. drop reference that have no already exported promises
  3. add support for explicit dropping of references, possibly as part of the deterministic GC scheme

@warner
Copy link
Member

warner commented May 6, 2020

Dean's note was recording three phases of our planned solution:

phase 1: drop references that have no embedded promises

In both directions (liveslots resolves the promise and tells the kernel via syscall.fulfill*, and also kernel notifies some other vat of the resolution via dispatch.notifyFulfill*), both sides use the following algorithm to decide whether the promise-id can be dropped:

  • if the resolution was with fulfillToPresence, the answer is yes
  • if the resolution is fulfillToData or reject, look at the list of slots. If there are any promises in these slots, the answer is no. If there are no promises, the answer is yes.

We think of this as a "90% solution" (ok maybe 80%). It doesn't require a lot of deep thinking and should still get us a useful performance win (in the form of reduced RAM and state-vector usage). It retires any result promise that's resolved to a single Presence, or to records with plain data and/or Presences. A lot of simple functions do this.

We defer implementing the new dispatch.notifyEndOfCrank and syscall.dropSlots for now.

phase 2: drop references that have no already exported promises

This is the "95% solution". The general idea is that resolutions which only reference brand new promises cannot introduce cycles that involve previously-existing promises, and wouldn't cause the recursion problem. Dean said a lot of ERTP/Zoe functions will return a record full of (new) Promises, and this would let their result promises be retired.

For the sending side (vat does syscall.fulfill*), the vat looks at the liveslots tables to see whether the referenced promises are already present or not. I'm not sure how we'll implement this: Dean suggested measuring the size of the tables before+after serialization and seeing if it changed, but we use the same table for both objects and promises. Perhaps we modify the tables to maintain a count of just promise allocation. Or we introduce a generation number and a WeakMap that maps each Promise to an integer, and then after serialization we look at generationMap.get(slotToVal.get(promiseSlot)) > previousGeneration to see if anything is new.

On the receiving side (kernel does dispatch.notifyFulfill*), it's a bit easier, because all the kernel is doing is a linear mapping of slot IDs (kpid and koid) into vat-side references (vpid and void), and it can just ask the c-list if the kernel-id is present before asking it to provide a new one.

phase 3: something more clever

This would be the part where we add a new syscall and/or dispatch method, and have liveslots accumulate the slots to drop at the end of the crank. This would be maybe a 98% case: it covers more situations, but we still need improved GC to release everything that's no longer reachable.

I (warner) am keen on making the drops be explicit. My thought is that which slots to drop is a policy decision, made based upon various heuristics about utility/performance-savings and soundness, and that explicit syscalls would make the invariants easier to maintain (if the kernel only drops entries in response to a syscall.dropSlots, the kernel and vats tables should always be aligned, even if the vat makes weird choices as to when to exercise its slot-dropping power). It'd be a bit weird if/because both sides get to drop things: the vat would call syscall.dropSlots after syscall.fulfill* (and the kernel should modify the c-lists to match), but on the receiving side maybe the kernel would direct the vat to delete entries after/during dispatch.notifyFulfill*. Maybe we pick one side or the other, and both sending and receiving result in e.g. the vat calling syscall.dropSlots. This will probably be easier to think about once we know how GC is going to work, where we don't want strong references point in both directions all the time.

I think Dean felt it was better to use implicit/inferred drops for the easy cases of resolved promises, and explicit calls for the more complicated cases we figure out later. He was concerned about the performance costs of having more syscalls, as well as what I think of as "debuggor ergonomics": how hard is it for the human (the "debuggor", you know, like "operator") performing the task of debugging (using a "debugger", the assistant software) to focus on the data they need. More syscalls means more noise in the logs, more things to ignore, more single-step iterations to get through to the important parts, etc.

@FUDCo FUDCo changed the title Mutually referential promises in arrays causes infinite kernel recursion Garbage collection of resolved kernel promises May 20, 2020
@FUDCo
Copy link
Contributor Author

FUDCo commented May 20, 2020

I retitled this issue from the bug that it originated with to reflect the more general discussion of the kernel promise GC problem that it morphed into. We can use this issue to track work on that.

@FUDCo
Copy link
Contributor Author

FUDCo commented May 21, 2020

Superseded by #1124. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

3 participants