Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more efficient/transactional state storage #54

Closed
warner opened this issue Sep 10, 2019 · 9 comments
Closed

more efficient/transactional state storage #54

warner opened this issue Sep 10, 2019 · 9 comments
Assignees
Labels
SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Sep 10, 2019

Over in cosmic-swingset, https://github.com/Agoric/cosmic-swingset/issues/65 has been collecting ideas to improve the way we store the kernel state. Most of that work needs to happen here, in SwingSet, with an eye towards providing an API that cosmic-swingset can connect to.

@warner
Copy link
Member Author

warner commented Sep 14, 2019

Background

  • A "Vat Turn" executes a single cycle of the Javascript event loop. At the beginning and end of the Turn, the JS stack is empty, but there may be callbacks ready to run on the Promise queue. One Turn can cause a second Turn by using e.g. Promise.resolve().then(turn2).
  • A "Crank" executes a single SwingSet delivery (created by a syscall.send) or a notify (created by syscall.resolve) to a single Vat. We remove a delivery from the front of the kernel's run-queue, hand it to a Vat, then "turn the crank" on that vat until there's no work left to do. At the beginning and end of the Crank, both the JS stack and the Promise ready-to-run queue are empty (there may be unresolved Promises with .then callbacks attached, but there will be no callbacks attached to resolved Promises). The kernel uses setImmediate to wait for the end of the Crank, since the JS promise queue is higher priority than the IO/Timer queue. One Crank can cause a second Crank by using e.g. syscall.send().
  • A "Batch" is a set of Cranks that the kernel runs before committing state to durable storage. The Batch ends when the kernel's run-queue is empty, or when we've done too much work (i.e. in a blockchain-based machine, we've filled the block and need to commit it to start a new one).

The Batch starts when an inbound message (or ack) is delivered to the Mailbox device, and it enqueues a message for the comms vat to process. It can also start when a timer event is delivered to the Timer device, which behaves the same way.

The host loop is responsible for calling controller.run() whenever work needs to be done (i.e. the run-queue is non-empty). For now, controller.run() always runs to completion: it performs cranks until the run-queue is empty. In the future, when we have some gas mechanism, if the run-queue is very full, controller.run() may decide to stop before the queue is empty because we've done too much work for a single batch (and/or the block into which these transactions are being recorded has reached a size limit).

In a blockchain environment, there will probably be one Batch per block, performed at the end of the block after all DeliverTx transactions have been executed, filling the run-queue. Timed events may be triggered at the start of the block, and share the batch with any transactions that arrive at the same time.

Requirements

For SwingSet correctness, we must only commit the kernel state at the end of a crank. For blockchain correctness, we must only commit the kernel state at the same points that the underlying blockchain commits its state (at a block boundary). For performance, we might only commit kernel state at the end of a batch. There is a tradeoff between minimizing the number of database commit cycles (which is typically pretty expensive) and minimizing the progress that might be lost due to a crash.

There are a number of problems that could prevent the turn from completing, and we must be careful to not commit (partial) state from a failed turn:

  • If the Vat makes a syscall that references an invalid import, or attempts to resolve a Promise for which it is not the decider, the Vat must be destroyed
  • If the Vat uses more CPU or memory than the gas rules allow, the turn is abandoned, and the Keeper is asked about what to do. The Keeper might refresh the gas Meter and request the turn be re-tried, or it might abandon the message, or it might abandon the Vat entirely (we still have to figure out a reasonable set of policies and APIs for this). We might have a synchronous Keeper that can "feed the meter" and allow the turn to be resumed, rather than re-submitting it (which takes a lot more work, since we must discard the JS heap state and rebuild it to get back to the pre-delivery state). (TODO: can the Keeper get control between Turns, or only at the end of the Crank? If the latter, we might have trouble)
  • The entire kernel might be interrupted partway through the turn (the computer it runs on might be rebooted).

The kernel does not know that a batch has finished (it only sees the individual cranks), so it must be told explicitly when to save its state. The host does this by calling controller.saveState().

My Plan:

(which is subject to change, because halfway through writing this we had an epic meeting that resulted in Agoric/SwingSet#149 and now I don't know what to build anymore)

HostDB

The host is responsible for providing a "HostDB" object (endowment) to the kernel, with an API that takes string keys/values:

  • get(key) -> value or undefined
  • startTransaction()
  • set(key, value)
  • delete(key)
  • deletePrefix(prefix)
  • finishTransaction()

The get call is only legal outside of a transaction. Transactions cannot be nested. There is no "abortTransaction": the kernel is expected to finish what it starts. The behavior is semantically equivalent to an API that had get and an applyChanges() that would take an array of set/delete/deletePrefix operations, but does not require all the operations to be held in RAM at the same time (i.e. it could use an underlying database library to write the proposed changes to disk, then commit at finishTransaction).

We currently expect the host to use LevelDB or something similar to implement this API. We avoid requiring transactional semantics from the host's choice of database.

If the kernel is not given a HostDB (i.e. during unit tests that don't exercise state management), it will create a dummy one internally.

KernelDB

The kernel will implement a "KernelDB" layer which provides (in order of read priority):

  • a "crank buffer", which holds writes from a crank that might later be aborted
  • a "batch buffer", which holds writes from committed cranks until the entire batch is committed
  • a read cache, for performance and managing RAM footprint
  • a reference to the HostDB object for use as the ultimate backing store

The KernelDB exposes a key-value API which is somewhat similar to that of the HostDB, with get, set, delete, and deletePrefix. These are used by the kernel's state-managing objects (state/kernelKeeper, vatKeeper, deviceKeeper), which translate specific queries (like C-List lookups) into get() calls. Each get() first checks the crank buffer, then the batch buffer, then the read cache. If the key is not present in any of these places, it performs a hostdb.get() and adds the data to the read cache before returning it to the caller. Anything in the read cache can be evicted at any time.

Operations that modify state (like adding something to a C-List) does a set() into the crank buffer, and do not touch anything else. They do not modify the batch buffer, because the crank might be aborted, nor do they write to the HostDB.

KernelDB also provides several control functions to the kernel itself (the "kernel" object, as constructed by buildKernel() in src/kernel/kernel.js). When the kernel finishes a crank, it will invoke kerneldb.finishCrank(), which pushes the contents of the crank buffer into the batch buffer, and erases the crank buffer.

The kernel does not know that a batch has finished (it only sees the individual cranks), so it must be told explicitly when to save its state. The host does this by calling controller.saveState().

Later, when the host does controller.saveState(), the kernel will tell KernelDB to write out its state by invoking kerneldb.finishBatch(). This causes KernelDB to send startTransaction() to HostDB, perform set() and other mutating calls for everything in the batch buffer, and then finally invoke finishTransaction() to commit the changes. It will copy all entries from the batch buffer into the read cache (or delete those read-cache entries), then erase the batch buffer.

The KernelDB API is thus:

  • get(key) -> value or undefined
  • set(key, value)
  • delete(key)
  • deletePrefix(prefix)
  • finishCrank()
  • finishBatch()

warner referenced this issue in warner/SwingSet Sep 19, 2019
.. as the index for kernel state data structures, instead of user-provided
vatName/deviceName strings. This will make it easier to use key/value state
storage (#144), since the keys will be more uniform (and shorter, and not
vulnerable to people using the separator string in their vat name).

Also it will make adding dynamic Vats easier (#19), as we'll just increment a
counter instead of asking the user to discover a unique name.

closes #146
warner referenced this issue in warner/SwingSet Sep 19, 2019
.. as the index for kernel state data structures, instead of user-provided
vatName/deviceName strings. This will make it easier to use key/value state
storage (#144), since the keys will be more uniform (and shorter, and not
vulnerable to people using the separator string in their vat name).

Also it will make adding dynamic Vats easier (#19), as we'll just increment a
counter instead of asking the user to discover a unique name.

closes #146
warner referenced this issue in Agoric/SwingSet Sep 19, 2019
.. as the index for kernel state data structures, instead of user-provided
vatName/deviceName strings. This will make it easier to use key/value state
storage (#144), since the keys will be more uniform (and shorter, and not
vulnerable to people using the separator string in their vat name). They look
like "vNN" and "dNN", where NN is a positive integer.

This will also make adding dynamic Vats easier (#19), as we'll just increment
a counter instead of asking the user to invent a unique name.

closes #146
warner referenced this issue in Agoric/SwingSet Sep 27, 2019
This is the first phase of #144, to narrow the API we need from a backing
store. All state mutations go through the keepers, and the keepers deal
exclusively with an object that uses string keys and string values. We still
JSON-serialize this object on demand (at the end of the batch/block).

The next phase will replace the buildVatController/buildKernel APIs to accept
a Storage object, with get/set/getRange/deleteRange methods, and have the
JSON-serialized backing store object live outside the kernel (in the host). A
subsequent phase will introduce commitCrank/commitBlock methods on that
Storage object, then a final phase will replace the JSON backing store with a
proper database.
warner referenced this issue in Agoric/SwingSet Sep 27, 2019
This is the first phase of #144, to narrow the API we need from a backing
store. All state mutations go through the keepers, and the keepers deal
exclusively with an object that uses string keys and string values. We still
JSON-serialize this object on demand (at the end of the batch/block).

The next phase will replace the buildVatController/buildKernel APIs to accept
a Storage object, with get/set/getRange/deleteRange methods, and have the
JSON-serialized backing store object live outside the kernel (in the host). A
subsequent phase will introduce commitCrank/commitBlock methods on that
Storage object, then a final phase will replace the JSON backing store with a
proper database.
warner referenced this issue in Agoric/SwingSet Sep 27, 2019
This is the first phase of #144, to narrow the API we need from a backing
store. All state mutations go through the keepers, and the keepers deal
exclusively with an object that uses string keys and string values. We still
JSON-serialize this object on demand (at the end of the batch/block).

The next phase will replace the buildVatController/buildKernel APIs to accept
a Storage object, with get/set/getRange/deleteRange methods, and have the
JSON-serialized backing store object live outside the kernel (in the host). A
subsequent phase will introduce commitCrank/commitBlock methods on that
Storage object, then a final phase will replace the JSON backing store with a
proper database.
warner referenced this issue in Agoric/SwingSet Sep 27, 2019
This is the first phase of #144, to narrow the API we need from a backing
store. All state mutations go through the keepers, and the keepers deal
exclusively with an object that uses string keys and string values. We still
JSON-serialize this object on demand (at the end of the batch/block).

The next phase will replace the buildVatController/buildKernel APIs to accept
a Storage object, with get/set/getRange/deleteRange methods, and have the
JSON-serialized backing store object live outside the kernel (in the host). A
subsequent phase will introduce commitCrank/commitBlock methods on that
Storage object, then a final phase will replace the JSON backing store with a
proper database.
warner referenced this issue in Agoric/SwingSet Sep 27, 2019
This is the first phase of #144, to narrow the API we need from a backing
store. All state mutations go through the keepers, and the keepers deal
exclusively with an object that uses string keys and string values. We still
JSON-serialize this object on demand (at the end of the batch/block).

The next phase will replace the buildVatController/buildKernel APIs to accept
a Storage object, with get/set/getRange/deleteRange methods, and have the
JSON-serialized backing store object live outside the kernel (in the host). A
subsequent phase will introduce commitCrank/commitBlock methods on that
Storage object, then a final phase will replace the JSON backing store with a
proper database.
warner referenced this issue in Agoric/SwingSet Sep 27, 2019
This rewrites the kernel state management, in support of #144. Previously, kernel/vat state was stored in a composite object, with everything being JSON-serialized into a single big string at the end of the batch/block. In addition, the kernel promise-management code was a bit casual about mutating that state from outside the keepers.

In phase 1, we still use JSON-serialization, but the object is now reduced to a strict string/string key-value store, and the storage/keeper layer returns hardened objects (so all mutations must go through the storage API).

In phase 2 (also in this PR), the internal API is refactored to reflect the upcoming DB interface. The JSON-serialized KV object is wrapped in a has/get/set -style API.

This PR retains compatibility with the existing host API (i.e. `buildVatController()` still accepts an `initialState=` string, and still has a `.getState()` method that returns a string). The next phases will replace this with a host-provided DB object.
warner referenced this issue in Agoric/SwingSet Oct 8, 2019
This removes c.getState(). Instead, the host should retain control over the
hostDB object it provides to the controller, so the host can choose when the
hostDB should commit a block's worth of changes.

The kernel's Keepers use a read-cache to minimize cross-Realm calls and
hostDB operations. This read-cache does not yet evict any entries, so a
future task is to build some basic LRU-like policy for it. But the largest
performance problem right now, the vat transcript, is specifically *not* kept
in the cache, so memory usage should be reduced somewhat even without proper
eviction.

refs #144

Extra thanks to @gamedevsam in #164 for a recommendation in kernelKeeper, to
replace for-loop starting points with named constants.
warner referenced this issue in Agoric/SwingSet Oct 8, 2019
This removes c.getState(). Instead, the host should retain control over the
hostDB object it provides to the controller, so the host can choose when the
hostDB should commit a block's worth of changes.

The kernel's Keepers use a read-cache to minimize cross-Realm calls and
hostDB operations. This read-cache does not yet evict any entries, so a
future task is to build some basic LRU-like policy for it. But the largest
performance problem right now, the vat transcript, is specifically *not* kept
in the cache, so memory usage should be reduced somewhat even without proper
eviction.

refs #144

Extra thanks to @gamedevsam in #164 for a recommendation in kernelKeeper, to
replace for-loop starting points with named constants.
warner referenced this issue in Agoric/SwingSet Oct 8, 2019
This removes c.getState(). Instead, the host should retain control over the
hostDB object it provides to the controller, so the host can choose when the
hostDB should commit a block's worth of changes.

The kernel's Keepers use a read-cache to minimize cross-Realm calls and
hostDB operations. This read-cache does not yet evict any entries, so a
future task is to build some basic LRU-like policy for it. But the largest
performance problem right now, the vat transcript, is specifically *not* kept
in the cache, so memory usage should be reduced somewhat even without proper
eviction.

refs #144

Extra thanks to @gamedevsam in #164 for a recommendation in kernelKeeper, to
replace for-loop starting points with named constants.
@warner
Copy link
Member Author

warner commented Oct 15, 2019

https://github.com/Level/awesome is a pretty amazing list of NPM modules in the LevelDB-ish ecosystem. I'm still trying to figure out how to deal with the asynchronous nature of the abstract-leveldown API, but if we can overcome that, the awesome list includes all sorts of LRU caches, copy-on-write layers, sub-tree management, merkle-dag structures, and transaction management.

@warner
Copy link
Member Author

warner commented Oct 15, 2019

Using Level-ecosystem modules in the kernel (e.g. for the CrankBuffer or the Read Cache) would require these modules to work under SES, which might be easy or might be difficult, depending upon how enthusiastic they are for subdependencies.

The whole abstract-leveldown API imposes an async boundary on all operations (including get(), which the actual LevelDB does synchronously, and put() which synchronously pushes the update into the kernel). I understand the conservative/contagious nature of async APIs all too well, and there are certainly Level-shaped backends that require it, but it's unfortunate that I can't use this ecosystem in a synchronous way.

The SwingSet kernel has a narrow interface with the host Realm, to prevent leaking objects across the Realm boundary (which could enable a sandbox breach). While we could build a Membrane that wraps Promises, it would be a hassle.

The more difficult aspect is that the kernel really wants to be able to handle syscalls synchronously. A syscall.send() requires a number of c-list entries to be retrieved, to translate object/promise refs into kernel-table indexes, which might update the c-list tables too. Then it must modify the run-queue or the promise table with the new message to be delivered. If the c-list lookups require an async Promise turn, the syscall.send() must return a Promise, and a subsequent syscall.send() cannot be allowed to run until the first one has finished modifying the tables (e.g. if the first send exports a Vat's object for the first time, it will allocate a new kernel object id, and the second send must get the same ID).

Our Vat execution model allows Vat Code (the ocap-layer) to use Promises internally, in particular inbound method invocations can return a Promise and the liveSlots layer does the right thing. But method sends (x~.foo()) should not require the liveSlots layer (and thus the ocap layer) to stall waiting for the syscall.send() to complete. Since most syscalls don't return anything, we could work around this by queueing all syscalls, not really executing any of them until control had safely returned to the kernel. Vats wouldn't observe any async behavior as they run, but the kernel would spend some time stalling on state lookups and sets after each Crank, before proceeding to pull the next item off the run-queue. However we anticipate device nodes providing synchronous returns (e.g. a Balance module for manipulating gas tokens with ERTP Purses), and that wouldn't work if syscalls must all be queued.

Dean and I talked through some ideas, and didn't find anything really satisfactory. For now, we're going to use the "really aggressive read cache" approach. We require that hosts provide synchronous DB access as defined by the current storageAPI, and to do this with LevelDB, the host must read the entire state vector into memory at startup, so it can satisfy synchronous reads as the kernel operates. Writes cause the in-memory cache to be updated, as well as getting flushed out to disk for next time.

This isn't great, but it gets me unblocked, and still reduces our memory footprint somewhat: we don't need to serialize a full copy of the state for each write. The memory footprint should be equal to one full copy of the state vector, plus a small epsilon during each write.

The next-better approach is to recognize that the transcripts could be fetched asynchronously, since we only need them during startup, and can take as much time as we want. We append to the transcript during operation, but never need to read from them. And transcripts are the most annoying part of the state vector, since they grow forever. To benefit from this distinction, we might want a getAsync() method in storageAPI.

@warner warner transferred this issue from Agoric/SwingSet Dec 1, 2019
@warner warner added the SwingSet package: SwingSet label Dec 1, 2019
dckc pushed a commit to dckc/agoric-sdk that referenced this issue Dec 5, 2019
@warner
Copy link
Member Author

warner commented Dec 18, 2019

Notes from today's conversation with Chris:

We're trying to address a use case where external users (in aggregate) have access to a "large number" (one zillion, in metric units, written 1Z) of Purses and Payments in some Issuer Vat. We have a lot of tables whose cardinality will be 1Z:

  • the Comms Vat has a bunch of c-lists (one per remote machine), and union of all their entries point to 1Z object imports (o-1234)
  • the Comms Vat's transcript contains 1Z messages, whose earlier delivery is what created those entries
  • the kernel-side c-list for the Comms Vat has 1Z entries mapping o-1234 to kernel objects like ko5678
  • the kernel object table tracks the owner of 1Z objects: ko5678 -> vat[issuer]
  • the kernel-side c-list for the Issuer Vat has 1Z entries mapping ko5678 to vat exports like o+9876
  • the Issuer Vat, in it's current transcript-based orthogonal-persistent form, has a javascript Map with 1Z entries that map vat exports like o+9876 to javascript Purse/Payment objects with various methods on them
  • the Issuer Vat has 1Z Purse/Payment objects
  • the Issuer Vat's transcript contains the 1Z messages that originally created and retrieved Purses/Payments

There are smaller tables too: the Comms Vat is talking to lots of remote machines, but not 1Z of them.

And there are other use cases that are slightly easier to deal with: 1Z messages to/from a Vat which doesn't create separate objects for each one (e.g. a simple counter: the only state is a single integer, but the transcript is very long). We could solve this case with specialized state management for the transcript, but it wouldn't help the more general "1Z objects" case.

IMG_6048

(the half-box on the left is the Comms Vat, the half-box on the right is the Issuer Vat, and the kernel is in the middle)

Our job is to move all of these zillion-sized tables off to secondary storage: on disk, not in RAM.

Synchronous Host/Kernel HostDB API

Our plan is to stick with a synchronous HostDB API:

  • bool = has(key)
  • value = get(key) (or undefined)
  • set(key, value)
  • del(key)

(our HostDB API also defines a iter = getKeys(start, end), which is currently only used for debugging; we may or may not keep this around)

To implement this in Node, we will need to write an FFI extension to provide bindings to the LevelDB C++ interface, as the abstract-leveldown ecosystem is entirely async. An alternative is to write low-level C++ code that splits a Node process into two threads, with the SwingSet kernel+vats in one (where HostDB calls block waiting for a queue/semaphore), and the leveldown code in the other (where each Promise triggers a semaphore write when it resolves). This alternative might give us access to more of the ecosystem, but would involve more surgery. I think we should start with the basic blocking LevelDB bindings.

We then extend the syscall interface (available to vats) with the same API. The low-level ("liveslots") code in each Vat gets access to a key-value store that is scoped to that one Vat, which obeys the usual Crank/Block transaction rules. This does not increase the authority of the Vat: it is not (directly) available to high-layer objects within the Vat, and cannot be used to communicate with other Vats.

Notably, the syscall.get API returns synchronously: the kernel routes this call all the way through to the host right away (if necessary, as there are a Crank Buffer and a Block Buffer in the way, which might satisfy the read first). This allows the Vat to use secondary storage as if it were RAM, without additional turns or ordering concerns, but not obligating it to hold all state in RAM at all times.

Delivery Processing

At the start of each "Crank", the kernel pulls the next message off the run-queue by reading a key from the HostDB. (The entire run-queue is currently stored in a single key, because it has the highest churn rate, and because the quiescent state is empty, but we might want to revisit this some day). This kernel message is translated into a Vat message:

  • Look up the target in the kernel object/promise table, to find out which Vat is responsible for it (the exporter of an object, or the decider of a promise), or if the message should be appended to an in-kernel promise queue. This should always succeed.
  • If the target is an unresolved Promise that does not have a Decider registered (i.e. the owner of the Promise is willing to participate in promise pipelining), the message is added to the promise table's queue without further processing. This causes writes to the promise table.
  • If the target is a resolved Promise that is not a legal delivery target (resolved-to-data, or rejected), an error resolution is constructed and queued to resolve the result promise, if any. This causes writes to the run-queue.
  • Otherwise the message will be delivered to a Vat, so we look up the target in the target Vat's c-list. This should also succeed, because a kernel object reference wouldn't be added to the kernel object table without first being exported by some Vat.
  • Look up all the message slots in the target Vat's c-list, adding new imports as necessary. Each slot which the kernel object table records as being owned by the target Vat should also be present in the Vat's c-list (this should be true for the same reason as above). This step will cause c-list writes for newly-imported object/promise references.

These steps take place before the Vat is invoked, so the kernel could take as long as it wants to perform the HostDB lookups and writes. The host does not know ahead of time what keys will be read, so for this phase the kernel is in control.

Syscall Processing

The kernel then submits the Vat message for delivery to a Vat, by calling dispatch.deliver() or some flavor of dispatch.notify, and waits for it to complete (by waiting for a setImmediate callback to fire, which ensures the Promise queue has been drained, and the Vat can no longer get control until the next dispatch). During this time, the Vat can make syscalls by invoking syscall.send, syscall.resolve, or one of the new storage APIs (has/get/set/del).

The syscall.send and syscall.resolve calls are queued for later. The storage API calls are handled to the Crank Buffer, which keeps a local copy of any state modification until the end of the Crank (whereupon they are either flushed to the Block Buffer, if the delivery succeeded, or discarded, if the vat was terminated). Storage APIs get an immediate answer, which may require the kernel invoke the HostDB to get answers from disk.

When the Crank is finished and the Vat gives up control to the kernel, we examine the queued send/resolve calls for legality. Each call has a slots vector that can name imported (o-12 or p-12) and exported (o+13 or p+13) objects and promises. send calls have a target that must name an import, and resolve calls have a target that must name an unresolved Promise for which this Vat is the Decider. The kernel looks up the imports and targets in the Vat's c-list, which queries the HostDB (through the buffers). If any are missing, or if resolve references the wrong kind of Promise, the Vat is terminated (for illegal access). The Crank Buffer is discarded (including all changes made via the Storage API), the Vat itself is deleted, we (somehow) cauterize all exports, and we (somehow) decref all imports to allow the owning Vats to GC their objects. We must still remove the delivery from the run-queue, so this run-queue write should happen after the crank buffer is abandoned.

If all imports are present and all targets are legal, the Crank will succeed, so the kernel proceeds to execute each queued send/resolve syscall in the same order they were made. The c-list is queried for each target and slot (using the HostDB as before), and any new exports are added (causing HostDB writes into the Crank Buffer). The translated message is appended to the run-queue. At this point, the Crank is complete, and the crank buffer (including all Storage API changes) gets written out to the Block Buffer.

Neither the kernel object/promise tables, nor any c-lists, are kept in memory: all entries are fetched from the HostDB on demand. This removes three of the 1Z-sized tables.

Transcript Processing

While the Vat runs, all the syscalls it makes (both send/resolve and Storage API) are logged by writing keys into the crank buffer. This "transcript entry" contains one delivery and zero or more syscalls. If the crank succeeds, these writes will be flushed to the Block Buffer, and will eventually land on disk when the block is complete.

When the kernel starts, it will rebuild each Vat, one at a time. After loading the Vat's initial code, the kernel will start fetching transcript entries, one at a time. It must have all the syscalls (specifically the return values of the StorageAPI calls) in memory before it can invoke the Vat's dispatch method. The method is invoked, and the syscalls it makes are compared against the transcript. If any mismatch is detected, the Vat is terminated (for anachrophobic behavior). No syscalls are actually executed, so the replay does not require access to c-lists or Storage API keys.

Once the delivery is complete, the transcript entry is forgotten, and the next one is fetched from the HostDB. No HostDB writes take place during the replay phase.

In the future, we may implement "demand paging" of Vats, where we defer loading a Vat (and replaying its transcript) into memory until someone sends it a message. We might also "page out" idle Vats, by dropping their dispatch object, allowing GC to free all their in-memory objects. This will result in transcripts being read after kernel startup.

This removes two of the 1Z-sized tables (the transcripts on both comms and issuer).

Large-Scale Vats

To complete the RAM scaling story, we must also avoid keeping 1Z objects in any of the Vats' heaps.

The Comms Vat is responsible for looking up incoming "remote object reference" identifiers (like ro+13) in the per-remote-machine c-list table. Legal remote references are translated into vat import identifiers (like o-14), which can be put into a syscall. Rather than keeping a Map in RAM, we have the Comms Vat do a get() syscall for each of these lookups. This avoids keeping 1Z entries in a Javascript Map in RAM, or 1Z strings pointed to by the Map.

The Comms Vat might also avoid using its heap for any state, keeping absolutely everything in kernel-provided secondary storage. If done properly, this would remove the need for orthogonal persistence, and the Vat could elect to turn off transcript creation and replay. This would save 1Z deliveries at restart time, at least for the Comms Vat.

The Issuer Vat keeps track of a balance and Assay for each Purse and Payment referenced by incoming object identifiers. This is currently managed by a Map in the "liveslots" layer, which maps from exported object reference (o+13) to a pass-by-presence Javascript object (e.g. the Purse). This object has some methods like deposit and withdraw and getBalance. It may close over access to a larger table, and/or it may be a key in a WeakMap, to support things like confirming the Purse and Payment come from the same Issuer.

To avoid 1Z entries in RAM, neither of these maps can be used. Instead, we must rewrite the Issuer to perform get() lookups on demand, and set() calls to record the consequences.

This means a big change to the programming model for Vats that need to handle large numbers of objects. We've brainstormed about ways to avoid the change. One possibility is a magic WeakMap-like object that responds to get() and set() by issuing storage-API syscalls. We would need some way for the high-level Vat code to build and configure this object, which would be in cahoots with the low-level Liveslots layer. Certain incoming method arguments would be short-lived marker objects, recognized by the WeakMap and used to determine the o+14-style key that should be passed to get().

We will also need changes like this to support schema/code upgrade.

Browser-Hosted Storage

Web browsers offer somewhat-durable storage in the form of APIs like IndexedDB, which are strictly asynchronous. However all secondary storage for Vats must be kept in RAM, because the kernel must respond to get() syscalls synchronously, and we have no means to extend the browser with synchronous APIs.

The simplest design is thus to keep all HostDB data in a plain Map, in RAM. At startup, all keys are read from IndexedDB (asynchronously) to populate this Map, after which the kernel is started and can read from the Map synchronously. At the end of each Block, the modified/deleted keys are written to IndexedDB (for the next reboot or page-load), then used to update the Map (for reads that happen before the next reload).

As noted below, many items (kernel c-lists, object tables, and transcripts) could be handled better, because our programming model doesn't obligate us to provide synchronous access to them. If we allow the host to distinguish the keys used to store transcript items, we could have it drop all transcript entries from the Map just after replay has finished, and write new transcript items only to IndexedDB (not the Map). If we extended the HostDB interface to allow async access (either returning a Promise, or arranging for later delivery of the requested keys), then we could omit all these items from the Map, and have the kernel perform these lookups between Vat invocations, where they can take as long as they need.

But the basic rule will probably be that Browser-hosted SwingSet machines cannot scale to anything involving a zillion objects. It may be possible for them to handle 1Z transcript entries without a large RAM footprint, but not large numbers of concurrent objects.

Rejected/Future Alternatives

Transcript replay is controlled by the kernel, not the Vat, so we could afford to have an async HostDB interface for the subset of keys that hold the transcript. One possibility would be a HostDB API that returns a Promise (although the SES split between kernel realm and primal realm makes this worrisome). Another would be a callback-based interface (slightly easier to membrane). A third would be for the kernel to request access to a range of keys, and then (on a later turn) the host could call a kernel API to supply the values (in order). The kernel should prevent reentrant calls while waiting for this response, just like it should prohibit a kernel.run() while waiting for the setImmediate to fire.

send/resolve syscalls are processed after the Vat yields control back to the kernel, so we could afford an async HostDB interface for the c-list and kernel-object/promise tables. We have the same API choices as for the transcript replay phase.

However neither helps with Vat access to secondary storage to avoid keeping large tables in RAM. This must either be synchronous (necessitating a synchronous HostDB API), or we would have to change the Vat dispatch API to have a two-phase prepare/commit scheme. In this scheme, the prepare() function gets the message and a table of key/values pairs, and it returns additional keys that will be needed. Once it stops asking for more, the commit() function is invoked with the same values. prepare doesn't get access to mutable state, and may be invoked multiple times. commit() gets run exactly once, and must succeed, with only the keys/values that prepare said it needs. All syscalls would have empty return values, and thus could be queued for processing after the delivery completes. This seems like a big change to the programming model, and maybe not a good fit for the kind of code we want to host. On the other hand, our planned approach for large scale vats is already a change to the model.

Dean had an idea about using composite identifiers like ko12.abc, where the kernel treats ko12.abc and ko12.xyz the same way, but delivers the .abc all the way through to the target Vat. The idea was to avoid the 3Z table entries (1Z in the comms-vat clist, 1Z in the kernel object table, and the third 1Z in the Issuer-vat clist). Imagine taking the set of "small" objects being granted and drawing a big circle around them (named "ko12"): the kernel only knows about the big circle, leaving the 1Z details to the vats on either side. I think this effectively gives the comms-vat full access to the big circle, possibly everything ever exported by the issuer vat. If we're comfortable doing that, then a simpler approach is to just let ko12 point to an Issuer object with an API that takes "abc" or "xyz" as an argument: hoist the object representation out of the kernel and into the Vats, letting them deal with the access control issues. So I'm not sure that composite identifiers solves or simplifies anything.

Dean seemed to feel that the Storage API should be done through Devices, rather than being provided directly by the kernel. I want to explore this more, but currently I think:

  • Secondary storage is a pretty fundamental thing to provide to computation: Linux provides sbrk() as a system call, not a device node.
  • Letting the kernel observe get() and set() might make it easier to meter storage usage and deduct gas
  • The kernel knows when a Crank is finished, and Devices do not, so it may be easier to get transactionality right by having the kernel's Crank Buffer hold Storage writes
  • Related to that, secondary storage for a Vat evolves at the same pace as the kernel's c-lists for that Vat, so storing them both in the same place should make it easier to keep them in sync
  • A possible counter-argument is if we think platform-level tables, like the cosmos-sdk "Bank" module and its ledger, should be managed through this Storage API. I'm inclined to say no, and use a Device to manage them.

The HostDB API might benefit from an iterator-based range query. We currently walk e.g. transcripts by incrementing a counter and querying has() to see if that key is present. If so, we do a get() and use the resulting key. When we implement Vat termination, we will need a similar mechanism to delete all keys associated with a Vat. This approach imposes some requirements on our choice of keys, to enable this kind of manual enumeration. A fancier API could remove this requirement and make this cheaper, enabling the host side to pre-fetch multiple keys (and LevelDB's on-disk structure supports very efficient range queries). However the SES primal/kernel-realm split requires us to wrap iterators carefully, and this may preclude some other HostDB API approaches in the future (WASM import functions, etc).

The Storage API syscalls we offer to Vats might benefit from a range query as well, although this interface is even harder to augment, since we need to record everything for the transcript, and the syscall boundary is even more likely to wind up crossing languages eventually.

To benefit from the leveldown ecosystem, and also enable large-scale Vats in a browser environment (with IndexedDB), we would need to change our execution model to tolerate async retrieval of Vat state.

@michaelfig
Copy link
Member

I think the "split into two threads and use a semaphore to block one" alternative is actually not that hard. Node.js already has the notion of worker threads (https://nodejs.org/api/worker_threads.html), which takes care of the threading and communication. For synchronization, I'd turn to the power of *nix, and use a filesystem FIFO as the semaphore. No need for FFI.

Fundamentally, for communication, our main thread just needs to call receiveMessageOnPort at the right time to poll the message port, and block if the message is not available.

I have a sample of how all this works together: https://gist.github.com/michaelfig/6d199ea95eab3ebdb918359612123a3e

@warner
Copy link
Member Author

warner commented Dec 20, 2019

Hmm. Ok, so in one sense, the host can do whatever it likes as long as it provides the right (sync) API. So the kernel would be unaware of the complexity of multiple threads. Host authors would be obligated to know about it (although for Node.js-specific environments we could hide that complexity in a library). The host is also responsible for managing the block-level transactionality: choosing when to update the durable state from which the next incarnation will restart, keeping it aligned with any other stored state (e.g. cosmos-sdk blocks), and withholding externally-visible messages until the state that generated them has been recorded (hangover inconsistency).

I'd be worried about performance.. we're basically serializing all reads and writes and shipping them through the kernel to the other thread, deserializing them again, then delivering them to the leveldown storage backend. Maybe not a big deal but I'd want to keep it in mind.

I suspect we could reduce the complexity a little bit by using pipe(2) rather than a FIFO, and not involve the filesystem at all. I also suspect we could fork off a separate process to manage the storage, instead of using Node's worker APIs. That might also make it more obvious for a non-Node host author to figure out how to satify the synchronous HostDB API.

Do you think the worker/multiprocess approach is feasible to integrate into the golang/cosmic-swingset world? There are even more threads going on there. An FFI interface might be a simpler component to use.

And we need to figure out a clean way to do this in an XS world. For that, I think we have do an FFI equivalent anyways, since all host authorities are coming from C code.

On the whole I'm more inclined to do an FFI thing, but maybe this trick would let us put that off for a while.

@michaelfig
Copy link
Member

Do you think the worker/multiprocess approach is feasible to integrate into the golang/cosmic-swingset world? There are even more threads going on there. An FFI interface might be a simpler component to use.

It's not really any more complex than adding a single additional worker thread. Multiprocess would need the same thing, backed by a socket/IPC of some sort.
We'd still need synchronous access to it, and the point of this hack is to keep the logic for the database access in Javascript.

For XS, I'd suggest multiprocess, with a blocking interprocess message-passing C primitive rather than implementing a Worker interface for it, nor having to write everything in C. We can do all kinds of cheating (such as shared memory) if we find this is too slow.

I suspect we could reduce the complexity a little bit by using pipe(2) rather than a FIFO, and not involve the filesystem at all.

I've only seen pipe(2) available as the node-pipe2 FFI extension, and couldn't get it to work. I think some magic with libuv is happening that prevents me from using fs.writeSync and fs.readSync successfully, whereas going all the way to the filesystem is reliable.

I'm thinking that writing FFI for a simple block/unblock mechanism (such as C++17 semaphore) that is processwide would be even better than trying to wrangle with FIFOs or file descriptors, especially to port to Windows.

Thanks for your comments.

@warner
Copy link
Member Author

warner commented Jan 29, 2020

We'll need to decide how to break down this task. The HostDB stuff is already in place, but it's backed by a all-RAM Map instance which gets written out to disk after each crank/block.

@warner
Copy link
Member Author

warner commented Mar 6, 2021

HostDB is now implemented (outside of tests) by a LevelDB backend, which offers a synchronous JS API.

The "reduce large tables" task is covered by a number of smaller tickets:

So I think we can close this now.

@warner warner closed this as completed Mar 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SwingSet package: SwingSet
Projects
None yet
Development

No branches or pull requests

3 participants