Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mailbox device nondeterminism after kernel restart #3471

Closed
warner opened this issue Jul 13, 2021 · 0 comments · Fixed by #3492
Closed

mailbox device nondeterminism after kernel restart #3471

warner opened this issue Jul 13, 2021 · 0 comments · Fixed by #3492
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet

Comments

@warner
Copy link
Member

warner commented Jul 13, 2021

Describe the bug

@dckc @mhofman and @warner have been analyzing the slogfiles collected from the agorictest-16 testnet (the "stress test" phase of the incentivized testnet). We found a divergence between two slogfiles, in which one validator observed an additional deliverInboundAck delivery (to vat-vattp), which was not in the record of a second validator. As a result, the global crankNums were forevermore off-by-one in that validator, and the vattp deliveryNum was also off-by-one.

I think this is a consequence of the mailbox device keeping the highestInboundAck state in RAM, rather than in durable storage.

SwingSet Mailbox Device Communication Protocol

The mailbox device follows the Waterken approach to durable delivery of messages: send forever until acked. When the mailbox has a connection-oriented link to some remote, we don't need to deliver a message or an ack more than once per connection, but we also can't afford to not send either when a new connection is established. So the process is:

  • upon process start, for each remote, try to establish a TCP connection
  • when a TCP connection is established, send all unacked messages as well as the current ack number (the highest seqnum of all processed/accepted inbound messages)
  • if an ack is received, remove those messages from our outbound list
  • if a message is received and accepted (durably), update our ack number and send it
  • if an outbound message is generated, add it to the unacked pool, and send it (along with the current ack number)
  • if the TCP connection drops, initiate a new connection attempt

This approach leads to the minimum traffic that still meets the goal of delivering and acking message as quickly as possible. If promptness weren't the goal, we could reduce the traffic slightly by not preemptively sending the ack number (and just waiting for the next outbound message to include it). The idea was that ack numbers are pretty small, so sending an extra copy at process startup (or TCP connection establishment) isn't expensive. A more drastic change would be to omit the preemptive message send too: don't send anything until the next outbound message is generated. This would run the risk of stalling messages for arbitrary amounts of time, which would be sad.

Mailbox device deduplication

This preemptive/precautionary re-transmission of messages+acks feels necessary but wasteful. I attempted to mitigate the overhead a little bit by having the mailbox device perform an initial deduplication of messages:

if (num > latestMsg) {
newMessages.push([num, `${hMsg}`]);
latestMsg = num;
highestInboundDelivered.set(peer, latestMsg);
}
});
if (newMessages.length) {
deliverInboundMessages(peer, harden(newMessages));
didSomething = true;
}

and the ack:

let latestAck = 0;
if (highestInboundAck.has(peer)) {
latestAck = highestInboundAck.get(peer);
}
if (ack > latestAck) {
highestInboundAck.set(peer, ack);
deliverInboundAck(peer, ack);

If the new message and/or ack passes this test, the mailbox device will enqueue a deliverInboundMessages() and/or deliverInboundAck() delivery to vat-vattp.

The mailbox state that remembers the inbound message and ack numbers is not durable: it is kept in a simple Map:

export function buildRootDeviceNode(tools) {
const { SO, getDeviceState, setDeviceState, endowments } = tools;
const highestInboundDelivered = harden(new Map());
const highestInboundAck = harden(new Map());

The VatTP vat also does a deduplication check, because the mailbox state (which remembers these sequence numbers) is not durable. So when the receiving swingset is restarted, the next arrival of a message from the outside world might not be deduplicated. So VatTP must always be prepared to dedup things itself. The comms vat receives a strictly-sequenced no-duplicates stream of messages (and therefore doesn't even get to see the vattp sequence numbers).

solo-to-chain protocol

The TCP approach only really makes sense for solo-to-solo connections, which are run over TCP of some sort. For the solo-to-chain pathway, we don't have connections, instead we have a series of signed messages (cosmos transactions) that contain the same { messages: [], ackNum } that the TCP payloads would have included.

However the delivery code was written with TCP in mind, so when the solo machine comes up, it sends all its unacked messages (and the inboundAckNum) before doing anything else, just in case the last time it sent these, the message didn't get through.

(TODO number 1: find a way to change the solo-to-chain delivery protocol to read the chain's outbound ack num first, from an RPC query, without requiring a transaction, instead of making an extra transaction per client restart. Or, switch to IBC, which does something very much like this already)

So when the client restarts, we'll get an extra signed transaction, which will probably contain zero messages, plus an inboundAckNum.

chain-side Mailbox device deduplication is not deterministic

Now imagine two validators which are started at time T=0, and a client which starts up some time later and sends a bunch of messages. The responses come back to the client, and the last thing the client does is to ack the last response, causing a { messages: [], ackNum: 5 } at time T=10.

Now say the client is restarted at T=20, so (as described above), the client will re-send this same message. Our cosmos signed transaction stream will contain a txn at T=10, and a second at T=20, both with no mailbox messages and with an ackNum of "5".

Let's say the "left" validator remains running the whole time. When the T=10 txn arrives, the Mailbox device will see ackNum: 5, compare it to its highestInboundAck (which would probably be "4"), conclude that vat-vattp needs to see this, and use SO(inboundHandler).deliverInboundAck(peer, ack) to enqueue a message to vattp. Then, later, when the T=20 txn arrives, the Mailbox device still remembers highestInboundAck=5, and ignores the message, not enqueueing anything.

But let's say the "right" validator was restarted at T=15. The T=10 txn was already delivered to vattp. But when the T=20 txn arrives, highestInboundAck is empty (because the Map that held it went away when the validator was restarted). So the mailbox device will enqueue a deliverInboundAck to vat-vattp.

This leads to a divergence between the two validators: the "right" validator will see an extra run-queue entry for the extra deliverInboundAck message. This will cause all subsequent crankNum values to be higher by one, as well as causing the vat-vattp deliveryNum values to be higher by one.

Since vat-vattp does its own deduplication, the deliverInboundAck delivery should not create any syscalls, and the effects will not propogate to other vats. Our current scheduling policy is strictly FIFO, and our current block-filling policy is effectively "drain the run-queue entirely" (there's a 1000-crank limit but we haven't observed that being reached yet), so the extra delivery is performed at the beginning of the block, and doesn't cause any other deliveries to be pushed out of the block. So the divergence is "hidden": it doesn't affect SwingSet's behavior enough for the chain to notice. We only discovered it by comparing slogfiles between validators that experienced different patterns of restarts, in particular a restart that occurred between two duplicate client-driven transactions.

The Fix

We have two approaches we could use.

The first, and simplest, is to just remove the deduplication code from the Mailbox device. The code should simply sort the inbound messages and deliver all of them to vattp, then deliver the ackNum to vattp as well. The two Maps should be deleted. The code will always return didSomething=true, which means the host will always be told to run the kernel after the potential delivery. This causes more work to be done in the case of duplicate deliveries, but the swingset work will be a deterministic function of the Mailbox input, which is important.

The second approach would be to track the Mailbox device's state durably. Devices get a getDeviceState/setDeviceState authority for this purpose, and other devices (e.g. the timer device) use them properly, but the Mailbox device is older and does not. We'd need to store the full contents of highestInboundDelivered and highestInboundAck in durable storage, instead of the Map. At device startup, we'd call getDeviceState() and populate an in-memory map. After every inbound message, we'd update the map with the new sequence/ack numbers, convert the Map back into something that marshal could serialize, and then write it all back to setDeviceState().

(This is messier than what the comms vat must do, because vats have syscall.vatStoreSet(key, value), which takes multiple keys. Devices are much more primitive. We could make the second approach easier and faster by introducing vatStore syscalls to devices. Then it could record each seqnum/acknum record in a separate key, instead of needing to smoosh everything together into the single setDeviceState() body).

I'm pretty sure we'll do the first approach: deleting code is always more reliable than adding.

@warner warner added bug Something isn't working SwingSet package: SwingSet labels Jul 13, 2021
@warner warner added this to the Testnet: Metering Phase milestone Jul 13, 2021
@warner warner self-assigned this Jul 13, 2021
warner added a commit that referenced this issue Jul 19, 2021
The mailbox device tracking the highest inbound message number and ack for
each peer, to de-duplicate repeated messages, so it could reduce the amount
of kernel activity. Each call to `deliverInbound` would return a boolean to
indicate whether the messages/ack were new, and thus the kernel needed to be
cycled.

However, the device was holding this tracking data in non-durable state, so
if/when the kernel was restarted, the state would be lost. A duplicate
message/ack arriving in the restarted process would trigger kernel activity
that would not have run in the original process. These extra cranks caused
diverge between validators when one of them was restarted, and the client
sent a duplicate message (such as the pre-emptive `ack` all clients send at
startup). The extra crank does not get very far, because vattp does its own
deduplication, so the divergence was only visible in the slog. But when #3442
is implemented, even a single extra crank will flag the validator as out of
consensus.

The fix is to remove the mailbox device's dedup code, and rely upon vattp for
this function. The test was also updated to match, and a new test (comparing
two parallel kernels, one restarted, one not) was added.

closes #3471
warner added a commit that referenced this issue Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for
each peer, to de-duplicate repeated messages, so it could reduce the amount
of kernel activity. Each call to `deliverInbound` would return a boolean to
indicate whether the messages/ack were new, and thus the kernel needed to be
cycled.

However, the device was holding this tracking data in non-durable state, so
if/when the kernel was restarted, the state would be lost. A duplicate
message/ack arriving in the restarted process would trigger kernel activity
that would not have run in the original process. These extra cranks caused
diverge between validators when one of them was restarted, and the client
sent a duplicate message (such as the pre-emptive `ack` all clients send at
startup). The extra crank does not get very far, because vattp does its own
deduplication, so the divergence was only visible in the slog. But when #3442
is implemented, even a single extra crank will flag the validator as out of
consensus.

The fix is to remove the mailbox device's dedup code, and rely upon vattp for
this function. The test was also updated to match, and a new test (comparing
two parallel kernels, one restarted, one not) was added.

closes #3471
warner added a commit that referenced this issue Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for
each peer, to de-duplicate repeated messages, so it could reduce the amount
of kernel activity. Each call to `deliverInbound` would return a boolean to
indicate whether the messages/ack were new, and thus the kernel needed to be
cycled.

However, the device was holding this tracking data in non-durable state, so
if/when the kernel was restarted, the state would be lost. A duplicate
message/ack arriving in the restarted process would trigger kernel activity
that would not have run in the original process. These extra cranks caused
diverge between validators when one of them was restarted, and the client
sent a duplicate message (such as the pre-emptive `ack` all clients send at
startup). The extra crank does not get very far, because vattp does its own
deduplication, so the divergence was only visible in the slog. But when #3442
is implemented, even a single extra crank will flag the validator as out of
consensus.

The fix is to remove the mailbox device's dedup code, and rely upon vattp for
this function. The test was also updated to match, and a new test (comparing
two parallel kernels, one restarted, one not) was added.

closes #3471
warner added a commit that referenced this issue Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for
each peer, to de-duplicate repeated messages, so it could reduce the amount
of kernel activity. Each call to `deliverInbound` would return a boolean to
indicate whether the messages/ack were new, and thus the kernel needed to be
cycled.

However, the device was holding this tracking data in non-durable state, so
if/when the kernel was restarted, the state would be lost. A duplicate
message/ack arriving in the restarted process would trigger kernel activity
that would not have run in the original process. These extra cranks caused
diverge between validators when one of them was restarted, and the client
sent a duplicate message (such as the pre-emptive `ack` all clients send at
startup). The extra crank does not get very far, because vattp does its own
deduplication, so the divergence was only visible in the slog. But when #3442
is implemented, even a single extra crank will flag the validator as out of
consensus.

The fix is to remove the mailbox device's dedup code, and rely upon vattp for
this function. The test was also updated to match, and a new test (comparing
two parallel kernels, one restarted, one not) was added.

closes #3471
warner added a commit that referenced this issue Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for
each peer, to de-duplicate repeated messages, so it could reduce the amount
of kernel activity. Each call to `deliverInbound` would return a boolean to
indicate whether the messages/ack were new, and thus the kernel needed to be
cycled.

However, the device was holding this tracking data in non-durable state, so
if/when the kernel was restarted, the state would be lost. A duplicate
message/ack arriving in the restarted process would trigger kernel activity
that would not have run in the original process. These extra cranks caused
diverge between validators when one of them was restarted, and the client
sent a duplicate message (such as the pre-emptive `ack` all clients send at
startup). The extra crank does not get very far, because vattp does its own
deduplication, so the divergence was only visible in the slog. But when #3442
is implemented, even a single extra crank will flag the validator as out of
consensus.

The fix is to remove the mailbox device's dedup code, and rely upon vattp for
this function. The test was also updated to match, and a new test (comparing
two parallel kernels, one restarted, one not) was added.

closes #3471
warner added a commit that referenced this issue Jul 22, 2021
The mailbox device tracking the highest inbound message number and ack for
each peer, to de-duplicate repeated messages, so it could reduce the amount
of kernel activity. Each call to `deliverInbound` would return a boolean to
indicate whether the messages/ack were new, and thus the kernel needed to be
cycled.

However, the device was holding this tracking data in non-durable state, so
if/when the kernel was restarted, the state would be lost. A duplicate
message/ack arriving in the restarted process would trigger kernel activity
that would not have run in the original process. These extra cranks caused
diverge between validators when one of them was restarted, and the client
sent a duplicate message (such as the pre-emptive `ack` all clients send at
startup). The extra crank does not get very far, because vattp does its own
deduplication, so the divergence was only visible in the slog. But when #3442
is implemented, even a single extra crank will flag the validator as out of
consensus.

The fix is to remove the mailbox device's dedup code, and rely upon vattp for
this function. The test was also updated to match, and a new test (comparing
two parallel kernels, one restarted, one not) was added.

closes #3471
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant