vattp sometimes drops messages? #3039

warner · 2021-05-05T00:53:12Z

Describe the bug

#3021 seems to be caused by vattp sometimes dropping some messages. We're investigating.

We're seeing something weird, where vattp is dropping messages randomly, which could cause a consensus failure. So we're going to pin it to a local worker for now, instead of letting it run on XS. refs #3039

warner · 2021-05-05T04:52:11Z

On branch warner-phase3-loadgen (ac4cb3983dbd38354e048ca500deeffd4ce85274), which is trunk at f052169 plus some logging fixes (and running all non-comms vats on XS), I saw the load generator halt when the chain stopped accepting messages. Looking further, the comms vat rejected a message whose internal sequence number was not one greater than the previously-received messages. By examining the slogfile, we found the comms vat was handed the following sequence of deliveries:

$ cat comms-halt/chain-1.slog| \
 jq -c 'select(.type == "deliver" and .vatID == "v13" and .kd[0] == "message" and .kd[2].method == "receive") | .kd[2].args.body|fromjson | .[0]' | cut -c-500

"717:644:resolve:fulfill:rp-752:ro+237;{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}"
"718:648:deliver:ro+24:getId:rp-757:ro+63;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0}]"
"719:648:deliver:ro+71:deposit:rp-758:ro+240;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}]"
"720:648:deliver:ro+70:deposit:rp-759:ro+239;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN payment\",\"index\":0}]"
"721:653:deliver:ro+71:getCurrentAmount:rp-760;[]"
"722:653:deliver:ro+78:getUpdateSince:rp-761;[48]"
"723:653:deliver:ro+70:getCurrentAmount:rp-762;[]"
"724:653:deliver:ro+77:getUpdateSince:rp-763;[47]"
"725:653:deliver:ro+70:withdraw:rp-764:ro+61;[{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":0},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"1858053\"}}]"
"730:657:deliver:ro+24:getId:rp-771:ro+63;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0}]"
"731:657:deliver:ro+24:getId:rp-772:ro+61;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":0}]"
"732:657:deliver:ro+70:getCurrentAmount:rp-773;[]"
"733:657:deliver:ro+77:getUpdateSince:rp-774;[48]"
"734:657:resolve:fulfill:rp-768:ro+241;{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN payment\",\"index\":0}"
"735:657:resolve:reject:rp-767;{\"@qclass\":\"error\",\"errorId\":\"error:liveSlots:v2#70001\",\"message\":\"\\\"unknown\\\" connection closed\",\"name\":\"Error\"}"

The gap between 725 and 730 is the proximate cause: comms was expecting 726 but received 730. All further messages will be rejected until it sees 726.

Looking at the vat-tp deliveries in the same slogfile, we it received deliverInboundMessages with batches of the following envelope (outer) sequence numbers:

$ cat comms-halt/chain-1.slog| \
 jq -c 'select(.type == "deliver" and .vatID == "v14" and .kd[0] == "message" and .kd[2].method == "deliverInboundMessages")' | \
 jq -c '.kd[2].args.body|fromjson | .[1] | map(.[0].digits)'

["709","710","711","712","713"]
["714","715","716","717"]
["718","719","720"]
["721","722","723","724","725","726","727","728","729"]
["730","731","732","733","734"]
["735"]

The crank which delivered messages 721-729 emitted the following syscalls:

$ cat comms-halt/chain-1.slog |jq -c 'select(.crankNum == 5863 and .type == "syscall") | .vsc'

["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"721\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"722\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"723\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"724\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"725\"}]","slots":[]}]
["send","o-50",{"method":"receive","args":{"body":"[\"721:653:deliver:ro+71:getCurrentAmount:rp-760;[]\"]","slots":[]},"result":"p+725"}]
["subscribe","p+725"]
["send","o-50",{"method":"receive","args":{"body":"[\"722:653:deliver:ro+78:getUpdateSince:rp-761;[48]\"]","slots":[]},"result":"p+726"}]
["subscribe","p+726"]
["send","o-50",{"method":"receive","args":{"body":"[\"723:653:deliver:ro+70:getCurrentAmount:rp-762;[]\"]","slots":[]},"result":"p+727"}]
["subscribe","p+727"]
["send","o-50",{"method":"receive","args":{"body":"[\"724:653:deliver:ro+77:getUpdateSince:rp-763;[47]\"]","slots":[]},"result":"p+728"}]
["subscribe","p+728"]
["send","o-50",{"method":"receive","args":{"body":"[\"725:653:deliver:ro+70:withdraw:rp-764:ro+61;[{\\\"brand\\\":{\\\"@qclass\\\":\\\"slot\\\",\\\"iface\\\":\\\"Alleged: RUN brand\\\",\\\"index\\\":0},\\\"value\\\":{\\\"@qclass\\\":\\\"bigint\\\",\\\"digits\\\":\\\"1858053\\\"}}]\"]","slots":[]},"result":"p+729"}]
["subscribe","p+729"]

The code which executes this crank is:

agoric-sdk/packages/SwingSet/src/vats/vat-tp.js

Lines 170 to 180 in f052169

    
           deliverInboundMessages(name, newMessages) { 
        
             const i = getRemote(name).inbound; 
        
             newMessages.forEach(m => { 
        
               const [num, body] = m; 
        
               if (num > i.highestDelivered) { 
        
                 // TODO: SO() / sendOnly() 
        
                 // console.debug(`receive from ${name}[${num}]: ${body}`); 
        
                 E(i.receiver).receive(body); 
        
                 i.highestDelivered = num; 
        
                 D(mailbox).ackInbound(name, num); 
        
               }

The syscalls suggest that the loop was interrupted somewhere after the D(mailbox).ackInbound on the loop pass for message number 725, and before the E(i.receiver) on the following pass for message number 726. If it had made it past that E(i.receiver) we'd expect to see a syscall.send() for 726 (although if the worker were somehow interrupted, or the crank ended early, that syscall might not make it out). If it had made it even further, to the D(mailbox) for 726, we'd expect to see the additional callNow, because these are executed right away (whereas E() schedules a syscall.send on some subsequent promise-queue turn).

@FUDCo analyzed a similar trace whose deliverInboundMessages was given a batch of size 1, and it failed to emit any syscalls. This suggests the fault happened before the E() call, rather than after the D().

We're struggling to imagine something that could cause this. We've brainstormed:

If a batch of messages are not internally ordered, the (num > i.highestDelivered) check would bypass some of them (we rely upon the sender to send internally ordered batches, and to send those batches in order, but we tolerate replays), but if so we would expect to see the syscalls happening in a different order
Something clobbers i.highestDelivered during the loop, causing the later messages to be skipped. There are no awaits in this code, so there's no obvious reentrancy hazard. E() queues behavior for later. D() causes an immediate blocking syscall
my wild idea is that somehow the blocking syscall.callNow invoked by D(), as implemented by xsnap (it turns into a blocking read on the message pipe) is interacting badly with the setImmediate we use inside liveslots to sense end-of-crank, and making liveslots think the crank is done early, making the syscalls for the remaining four messages get lost
some sort of metering fault might have interrupted the vat, but the deliver-results showed a normal crank result, not a meter exhaustion error
if num or i.highestDelivered weren't both BigInts, I could imagine a lexicographic (string) comparison breaking things, but the comparison should have been between 726 and 725, which is not e.g. 9 vs 10.

warner · 2021-05-06T00:28:20Z

@FUDCo and I are copying the kernelDB from a chain run that hit this failure, and are attempting to replay the swingset transcript for the vat in question, to see if it does the same thing upon replay. If so, it suggests that, whatever is going on, it's at least a function of the state of that vat.

The particular failure we're investigating is in a run we've named bug3021-05may. It occurs in a deliverInboundMessages call to vattp which receives a batch of seqnums 9239+9240+9241, but only emits syscalls for 9239 before finishing.

The "in-vivo" replace I did (by just restarting the chain, after adding a log message to show when syscalls are being made during replay, and replacing the kernelBundle key of the DB so the addition would take effect) shows that the missing syscalls are also missing during replay. Yay! This suggests that we might be able to reproduce it "in-vitro": @FUDCo is working on that with swingset-runner. That'd be great, because restarting the chain with it's current DB takes 18 minutes, and swingset-runner ought to be somewhat faster. Especially if we modify the DB a little more and make it forget about the vats we don't care about.

warner · 2021-05-06T08:05:47Z

We're able to reproduce the problem in a somewhat-reduced test environment, where we copy the kernelDB from the failing chain, strip out all the vats except for vattp, and re-run it under swingset-runner. This takes about 2.5min to reproduce, whereas the original chain needed 12 hours of once-a-minute traffic to break.

The problem manifests at a simple forEach loop, working on an Array of either 1 or 3 entries (which came from marshal, so ultimately from JSON.parse), which, sometimes, exits early for no good reason.

The fun part is that the crank which exhibits the error changes as a function of the code that xsnap is given (supervisor, liveslots, and vat code). We updated our transcript replay code to not only sense if a vat makes a syscall which differs from the ones recorded in the transcript, but also it fails to make one of the recorded ones. If we run the original code, we see no transcript mismatches, because the crank that showed the error managed to show exactly the same error in the replay. If we modify the code a bit, and the bug happens during some different (earlier) crank, then we see fewer syscalls than we expected, and the replay terminates early with an error.

The good news is that, from what I can tell, it's a deterministic function of that code, so at least the bug is stable, but it tends to move or go away if you add debugging code. It seems to be insensitive to whitespace.

Our current theory is some sort of memory allocation bug, maybe during GC. Maybe something which is causing the short array to get truncated, or to corrupt the variable holding i.highestDelivered. It took a long time to manifest (12 hours, tens of thousands of messages), and happened in the one XS vat that performs the most work, which means there was more allocation (and more of everything) there.

dckc · 2021-05-06T10:31:35Z

The most concise way I have found for packaging up such things is a stand-alone Makefile that requires only that node and yarn are installed; for example:

https://gist.github.com/dckc/19451649697bb7cb231ab5a776fde0d1#file-makefile

p.s. This assumes the bug will show up on machines other than your own. We've seen bugs where the symptoms were hard to reproduce on another machine; the culprit seemed to be that paths were different between machines and the paths made their way into the runtime heap.

warner · 2021-05-07T16:33:09Z

Yeah, that's where I'll aim. At @FUDCo 's advice, I'm going to start big+fast (ala "I regret this letter is so long, I did not have the time to make it shorter"), and defer refinement/reduction until after we've delivered something sufficient for them to reproduce it.

dckc · 2021-05-13T16:19:42Z

When I tried on Monday to reproduce this, I got anachrophobia strikes vat v14; I learned that this means that I did not manage to reproduce it.

dckc · 2021-05-17T15:28:55Z

@warner Moddable sent us a WeakMap GC fix; I checked that in as agoric-labs/moddable@c533fb7 ( on a https://github.com/agoric-labs/moddable/tree/weak-gc-fix branch) and I created a https://github.com/Agoric/agoric-sdk/tree/3039-vattp-xs-gc branch that is master as of May 5 notede above ( f052169 ) with the moddable submodule set to the GC fix.

Care to try it out? Or maybe I could figure out how...

erights · 2021-05-27T21:29:51Z

Which weakmap gc problem is this supposed to fix?

dckc · 2021-05-28T05:54:02Z

Which weakmap gc problem is this supposed to fix?

Er... The one where vattp sometimes drops messages, as detailed at length above.

rowgraus · 2021-06-21T21:11:14Z

In today's testnet meeting:
BW: we have a change that's supposed to fix this but we haven't verified it
DC: right; I think it's fixed but I'm not certain.
BW: probably ok to release without further work on this

Deprioritizing for now

dckc · 2021-08-02T16:00:46Z

@warner do I have access to the bug3021-05may logs? If not, please use the 1st key in https://github.com/dckc.keys to grant me access?

And please sketch how to use swingset-runner to reproduce the problem "in vitrio"?

warner added bug Something isn't working SwingSet package: SwingSet labels May 5, 2021

warner self-assigned this May 5, 2021

warner mentioned this issue May 5, 2021

fix(swingset): force vattp to run on worker=local for now #3040

Closed

dckc added the xsnap the XS execution tool label May 6, 2021

dckc modified the milestones: Testnet: Staking Dynamics, Testnet: Stress Test Phase May 6, 2021

dckc mentioned this issue May 6, 2021

stack traces from xs-worker vats lack filenames, line numbers #2578

Closed

3 tasks

This was referenced May 14, 2021

cannot coerce detailsToken to string endojs/endo#729

Closed

weak map/set entry could be stored into a removed entry agoric-labs/moddable#7

Closed

dckc self-assigned this May 14, 2021

rowgraus removed this from the Testnet: Stress Test Phase milestone Jun 21, 2021

Tartuffo unassigned dckc Nov 4, 2022

Tartuffo added migrate-in-progress and removed migrate-in-progress labels Nov 17, 2022

ivanlei added the vaults_triage DO NOT USE label Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vattp sometimes drops messages? #3039

vattp sometimes drops messages? #3039

warner commented May 5, 2021

warner commented May 5, 2021

warner commented May 6, 2021

warner commented May 6, 2021

dckc commented May 6, 2021 •

edited

Loading

warner commented May 7, 2021

dckc commented May 13, 2021

dckc commented May 17, 2021

erights commented May 27, 2021

dckc commented May 28, 2021

rowgraus commented Jun 21, 2021 •

edited by dckc

Loading

dckc commented Aug 2, 2021

vattp sometimes drops messages? #3039

vattp sometimes drops messages? #3039

Comments

warner commented May 5, 2021

Describe the bug

warner commented May 5, 2021

warner commented May 6, 2021

warner commented May 6, 2021

dckc commented May 6, 2021 • edited Loading

warner commented May 7, 2021

dckc commented May 13, 2021

dckc commented May 17, 2021

erights commented May 27, 2021

dckc commented May 28, 2021

rowgraus commented Jun 21, 2021 • edited by dckc Loading

dckc commented Aug 2, 2021

dckc commented May 6, 2021 •

edited

Loading

rowgraus commented Jun 21, 2021 •

edited by dckc

Loading