hard to correlate message sends and deliveries in slogfile #6501

warner · 2022-10-27T00:10:32Z

What is the Problem Being Solved?

@mhofman and @dckc have been writing tools to parse slogfiles, to build up Causeway-style visualizations (sometimes going through OTEL spans and tooling like Honeycomb/DataDog).

One problem they've run into is how to correlate a syscall.send() that emits a message (method invocation on some target object/promise) with the subsequent dispatch.deliver() that hands it into a vat. The easy/cheap approach is to assume that each syscall.send() uses a unique result promise, and assume that result promises are never forwarded to other promises. In that case, you can look for the dispatch.deliver() whose result= matches the sender's result=. Even if that target of the message is a promise (so delivery might need to wait until that promise gets resolved, adding a second cause to the delivery), the result will still show up only once, on the final delivery.

A similar assumption can be applied to trace the causal connection between syscall.resolve() and the subsequent dispatch.notify() for the zero or more subscribers of the promise. As long as the kpid doesn't get re-used, the analysis tool can build a table index by kpid, which lists both the resolving crank/syscall and the list of notification cranks (deliveries).

But, this only works within a single kernel, where we can use the kpid as the index. If we're looking at two different kernels (e.g. a chain and an ag-solo), then we don't have a shared reference namespace to do the correlation. Without something extra, the analysis tool would need to look at the internal states of the two comms vats (e.g. look for their syscalls and interpret the messages they exchange) to build up a table from kref to rref, which is a hassle, and only works if the tool starts looking early enough (at the first import/export, the one that establishes the comms c-list entries). We'd have the same problem within a single kernel if we were only looking at vrefs.

It might be handy if each message had an extra "message ID" that would enable correlation of sends and receives (and also of resolves/notifies) independent of the more functional properties of a syscall or message. This would also remove the assumptions about uniqueness of result kpids.

Description of the Design

The nominal solution is for each vat to add a unique identifier (the "message ID") to the arguments of its syscall.send and syscall.notify. The kernel would copy this messageID into the resulting run-queue item(s), and then later into the VatDeliveryObject we send into a vat. The slogfile analysis tools could then index on the messageID instead of resultKPID. The comms vat would copy messageID into the remote messages, so they could be correlated across kernels.

The interesting constraints on this nominal solution are:

global uniqueness, despite the lack of any other coordination between vats or between kernels
privacy of the vat's internal state that shouldn't be visible to external parties
deterministic execution of each vat

These constraints also affect things like how our censored assert.details Error-logging system assigns Error#1234 counters to each Error object during logging and serialization.

The simplest way to make the identifier unique is to use an incrementing counter within each vat, attach the vat's unique-within-the-kernel vatID (e.g. v123) to it, and then to attach a kernel's (globally) unique ID to that. We don't have a globally-unique kernel ID now, but we could imagine adding one to the swingset config file (obligating the host application to create a random UUID during initialization), or allowing controller.initializeSwingset() to allocate a random one. Then either the slogfile could record the kernel ID once (and tools remember it, and use it as a scope for the vatID+counter message-IDs they see elsewhere), or we change the slogfile to record the full kernelID+vatID+counter on each message (eww, but it would make the parsing tools simpler).

However, that causes a privacy problem: outside parties gets to learn how many messages you have sent, which might reveal internal details that they shouldn't be able to observe. In a chain environment, we've already lost most confidentiality, but it would be unfortunate to bake anti-privacy features into SwingSet when it can still be used in more private modes (e.g. three solo machines talking to each other).

One expression of this problem would appear among three vats (vatA/B/C) in a single kernel. VatA sends msg1 to VatB, then msg2 to VatC, then msg3 to VatB. By comparing messageID counter values in the messages it receives, VatB can deduce that VatA sent an extra message to somebody else inbetween msg1 and msg3, even though VatB was not a party to that interaction. This could reveal internal state about VatA which it desired to conceal from VatB.

Within a kernel we could e.g. slog the messageID in the ksc/kd kernel-flavored part of the slog entry, but strip it from the vsc/vd (VatDeliveryObject) vat-flavored/transformed part, thus hiding it from all vats but still exposing it in the slog for analysis tools. But that would mean stripping it from the comms vat, preventing its use in cross-kernel event correlation. And if we added a mode to privilege the comms vat (by retaining the messageID), we'd be expanding this attack across kernels: kernelB could learn about the activity of vats inside kernelA that it shouldn't.

The second simplest way to achieve uniqueness is for each vat to create a completely random identifier for each message. This won't expose any internal state, because nothing outside the vat could predict the messageID values and thus count the number of missing ones. But vats are supposed to be deterministic: making random numbers is well beyond their abilities.

One simplifying trick would be for the kernel to provide a unique seed during startVat, and then the vat can use a KDF to hash that seed plus a per-message counter, generating a secure pseudorandom ID for each message. The vat's behavior would remain a deterministic function of its inputs (including the secret seed), but remains unpredictable (by a party which doesn't know the seed, i.e. other vats or other kernels). The kernel could, in turn, build the per-vat seed by hashing a per-kernel secret with the counter-based vatID. The kernel would need to derive an independent value for use as it's (public) random kernelID component. The whole stack would look like:

config.kernelUniqueSecret = makeRandomBytes(32) (host provides once, during initialization)
- chains must use a hard-coded value from genesis block, of course
kernel: kernelUniqueID = HKDF(kernelUniqueSecret, 'kernelUniqueID')
kernel: vats[vatID].vatUniqueSecret = HKDF(kernelUniqueSecret, 'vatUniqueSecret:'+vatID)
- dispatch.startVat({ bundle, vatParameters, vatUniqueSecret })
vat: vatMessageID = HKDF(vatUniqueSecret, 'messageID:' + counter); counter++
kernel: kernelMessageID = kernelUniqueID + ':' + vatID + ':' + vatMessageID

HKDF-SHA256 is my favorite key derivation function, but a simple hash of concatenated values would probably be enough.

The message IDs could be trimmed to reduce visual bloat/noise in the slogfiles (to make things easier for humans to read), subject to an increased probability of collisions. The birthday bound approximation says n samples from a set of size H yields a collision probability of n^2 / 2*H. I like to work with negative ln2 probability units: 1 ubit (short for "unlikeliness in bits" or perhaps "unfortubits") is 50%, one-in-a-thousand is about 10 ubits (1-in-1024 would be exactly 10 ubits), one-in-a-million is about 20 ubits. If we measure our number of samples in sbits (1024 samples is 10 sbits), and our ID hashes in hbits (a 256-bit SHA256 output is 256 hbits), then the approximation is ubits = 1 + hbits - 2*sbits, or hbits = ubits + 2*sbits - 1.

If we take the conservative assumption that the analysis tool might observe 1M unique IDs per vat (sbits = 20) , and we want accidental collisions to occur less than once in a million trials (ubits = 20), then we need 20+2*20-1 = 59-bit identifiers. This could be achieved with 12 base32 characters, or 10 base62 characters (avoiding the two visually-noisy and cut/paste-annoying punctutation marks).

Adding in the vatIDs and kernelIDs (for which we might assume num(chains) + num(solos) is also about 1M, thus needing 59-bit kernel IDs too), and inserting k and m prefixes for the kernelID/messageID for good measure, the full identifier might look like:

{ type: 'syscall', ksc: ['send', 'ko12', { methargs, id: 'kN9knxZKJP0:v12:mCkQV8Uxz8v', result: 'kp34' } ] }`

This reveals the vatID to other kernels, which reveals the number of vats that have been created, which is another thing that might want to be concealed. The kernel could use a similar technique to generate an opaque vat identifier string by hashing the per-kernel secret and the sequential v12 vatID counter. Local slog entries already have vatID fields, so there's no particular utility to making the vatID inside the messageID be human-readable (it merely needs to be sufficiently unique). Assuming 1M vats per kernel, the identifiers would then look something like kN9knxZKJP0:vjVzl1d8EGH:mCkQV8Uxz8v.

Security Considerations

Confidentiality of private vat state (message count).

Confidentiality of private kernel state (vat counts).

Spoofability of messages: to make correlation through the comms vat useful, the comms vat needs to be able to fully specify the messageID when it does a syscall.send (comms will extract this from the inbound from-some-other-kernel message it processes, so the receiving kernel's slog will observe events with a foreign kernel's kernelID). That implies that vats in general can choose their own messageID values, which raises the possibility of one vat pretending to be a different vat when it emits messages, just to confuse an analysis tool that might later look at the message sequence. I think the only way to prevent this is to have vats emit just a messageID (omitting the kernel/vat parts), have the kernel impose the extra components, and give the comms vat extra spoofing powers.

Test Plan

Unit tests, both on liveslots, the kernel run-queue, and comms.

The text was updated successfully, but these errors were encountered:

mhofman · 2022-10-27T00:26:17Z

The biggest issue with matching on promise results are:

missing results for sendOnly (made by devices, since liveslots is currently unable to perform sendOnly)
reflected sends on pipelining vats, which will cause multiple syscall and deliveries with the same promise result, but potentially different targets (and technically it could be different messages)

I think within a SwingSet it's totally fine for a unique send id to be generated by the kernel. Regular vats probably don't even need to be aware of this id. Pipelining vats would be made aware of them to cooperate on send reflection. A couple days ago, I believe you mentioned a clist based approach for these send ids. Within a kernel it's probably fine to trust pipelining vats to handle the send id correctly. However tracking reflected sends across the comms boundary is where it gets interesting, and where a clist approach for send ids may need to be necessary.

warner added enhancement New feature or request SwingSet package: SwingSet labels Oct 27, 2022

Tartuffo added migrate-new-issues and removed migrate-new-issues labels Nov 17, 2022

ivanlei added the vaults_triage DO NOT USE label Jan 17, 2023

dckc mentioned this issue Aug 23, 2023

feat(swingset): slogfile visualization: PlantUML, causeway (WIP) #3624

Draft

mhofman mentioned this issue Aug 24, 2023

Execution context to enable flow-like scheduler #7875

Open

mhofman added the telemetry label Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hard to correlate message sends and deliveries in slogfile #6501

hard to correlate message sends and deliveries in slogfile #6501

warner commented Oct 27, 2022

mhofman commented Oct 27, 2022

hard to correlate message sends and deliveries in slogfile #6501

hard to correlate message sends and deliveries in slogfile #6501

Comments

warner commented Oct 27, 2022

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

mhofman commented Oct 27, 2022