Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vattp sometimes drops messages? #3039

Open
warner opened this issue May 5, 2021 · 11 comments
Open

vattp sometimes drops messages? #3039

warner opened this issue May 5, 2021 · 11 comments
Assignees
Labels
bug Something isn't working SwingSet package: SwingSet vaults_triage DO NOT USE xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented May 5, 2021

Describe the bug

#3021 seems to be caused by vattp sometimes dropping some messages. We're investigating.

@warner warner added bug Something isn't working SwingSet package: SwingSet labels May 5, 2021
@warner warner self-assigned this May 5, 2021
warner added a commit that referenced this issue May 5, 2021
We're seeing something weird, where vattp is dropping messages randomly,
which could cause a consensus failure. So we're going to pin it to a local
worker for now, instead of letting it run on XS.

refs #3039
michaelfig pushed a commit that referenced this issue May 5, 2021
We're seeing something weird, where vattp is dropping messages randomly,
which could cause a consensus failure. So we're going to pin it to a local
worker for now, instead of letting it run on XS.

refs #3039
@warner
Copy link
Member Author

warner commented May 5, 2021

On branch warner-phase3-loadgen (ac4cb3983dbd38354e048ca500deeffd4ce85274), which is trunk at f052169 plus some logging fixes (and running all non-comms vats on XS), I saw the load generator halt when the chain stopped accepting messages. Looking further, the comms vat rejected a message whose internal sequence number was not one greater than the previously-received messages. By examining the slogfile, we found the comms vat was handed the following sequence of deliveries:

$ cat comms-halt/chain-1.slog| \
 jq -c 'select(.type == "deliver" and .vatID == "v13" and .kd[0] == "message" and .kd[2].method == "receive") | .kd[2].args.body|fromjson | .[0]' | cut -c-500

"717:644:resolve:fulfill:rp-752:ro+237;{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}"
"718:648:deliver:ro+24:getId:rp-757:ro+63;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0}]"
"719:648:deliver:ro+71:deposit:rp-758:ro+240;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}]"
"720:648:deliver:ro+70:deposit:rp-759:ro+239;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN payment\",\"index\":0}]"
"721:653:deliver:ro+71:getCurrentAmount:rp-760;[]"
"722:653:deliver:ro+78:getUpdateSince:rp-761;[48]"
"723:653:deliver:ro+70:getCurrentAmount:rp-762;[]"
"724:653:deliver:ro+77:getUpdateSince:rp-763;[47]"
"725:653:deliver:ro+70:withdraw:rp-764:ro+61;[{\"brand\":{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":0},\"value\":{\"@qclass\":\"bigint\",\"digits\":\"1858053\"}}]"
"730:657:deliver:ro+24:getId:rp-771:ro+63;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD brand\",\"index\":0}]"
"731:657:deliver:ro+24:getId:rp-772:ro+61;[{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN brand\",\"index\":0}]"
"732:657:deliver:ro+70:getCurrentAmount:rp-773;[]"
"733:657:deliver:ro+77:getUpdateSince:rp-774;[48]"
"734:657:resolve:fulfill:rp-768:ro+241;{\"@qclass\":\"slot\",\"iface\":\"Alleged: RUN payment\",\"index\":0}"
"735:657:resolve:reject:rp-767;{\"@qclass\":\"error\",\"errorId\":\"error:liveSlots:v2#70001\",\"message\":\"\\\"unknown\\\" connection closed\",\"name\":\"Error\"}"

The gap between 725 and 730 is the proximate cause: comms was expecting 726 but received 730. All further messages will be rejected until it sees 726.

Looking at the vat-tp deliveries in the same slogfile, we it received deliverInboundMessages with batches of the following envelope (outer) sequence numbers:

$ cat comms-halt/chain-1.slog| \
 jq -c 'select(.type == "deliver" and .vatID == "v14" and .kd[0] == "message" and .kd[2].method == "deliverInboundMessages")' | \
 jq -c '.kd[2].args.body|fromjson | .[1] | map(.[0].digits)'

["709","710","711","712","713"]
["714","715","716","717"]
["718","719","720"]
["721","722","723","724","725","726","727","728","729"]
["730","731","732","733","734"]
["735"]

The crank which delivered messages 721-729 emitted the following syscalls:

$ cat comms-halt/chain-1.slog |jq -c 'select(.crankNum == 5863 and .type == "syscall") | .vsc'

["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"721\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"722\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"723\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"724\"}]","slots":[]}]
["callNow","d-70","ackInbound",{"body":"[\"agoric170hlq9fmccejshex7xc3uvmzxj6ue5hawldamf\",{\"@qclass\":\"bigint\",\"digits\":\"725\"}]","slots":[]}]
["send","o-50",{"method":"receive","args":{"body":"[\"721:653:deliver:ro+71:getCurrentAmount:rp-760;[]\"]","slots":[]},"result":"p+725"}]
["subscribe","p+725"]
["send","o-50",{"method":"receive","args":{"body":"[\"722:653:deliver:ro+78:getUpdateSince:rp-761;[48]\"]","slots":[]},"result":"p+726"}]
["subscribe","p+726"]
["send","o-50",{"method":"receive","args":{"body":"[\"723:653:deliver:ro+70:getCurrentAmount:rp-762;[]\"]","slots":[]},"result":"p+727"}]
["subscribe","p+727"]
["send","o-50",{"method":"receive","args":{"body":"[\"724:653:deliver:ro+77:getUpdateSince:rp-763;[47]\"]","slots":[]},"result":"p+728"}]
["subscribe","p+728"]
["send","o-50",{"method":"receive","args":{"body":"[\"725:653:deliver:ro+70:withdraw:rp-764:ro+61;[{\\\"brand\\\":{\\\"@qclass\\\":\\\"slot\\\",\\\"iface\\\":\\\"Alleged: RUN brand\\\",\\\"index\\\":0},\\\"value\\\":{\\\"@qclass\\\":\\\"bigint\\\",\\\"digits\\\":\\\"1858053\\\"}}]\"]","slots":[]},"result":"p+729"}]
["subscribe","p+729"]

The code which executes this crank is:

deliverInboundMessages(name, newMessages) {
const i = getRemote(name).inbound;
newMessages.forEach(m => {
const [num, body] = m;
if (num > i.highestDelivered) {
// TODO: SO() / sendOnly()
// console.debug(`receive from ${name}[${num}]: ${body}`);
E(i.receiver).receive(body);
i.highestDelivered = num;
D(mailbox).ackInbound(name, num);
}

The syscalls suggest that the loop was interrupted somewhere after the D(mailbox).ackInbound on the loop pass for message number 725, and before the E(i.receiver) on the following pass for message number 726. If it had made it past that E(i.receiver) we'd expect to see a syscall.send() for 726 (although if the worker were somehow interrupted, or the crank ended early, that syscall might not make it out). If it had made it even further, to the D(mailbox) for 726, we'd expect to see the additional callNow, because these are executed right away (whereas E() schedules a syscall.send on some subsequent promise-queue turn).

@FUDCo analyzed a similar trace whose deliverInboundMessages was given a batch of size 1, and it failed to emit any syscalls. This suggests the fault happened before the E() call, rather than after the D().

We're struggling to imagine something that could cause this. We've brainstormed:

  • If a batch of messages are not internally ordered, the (num > i.highestDelivered) check would bypass some of them (we rely upon the sender to send internally ordered batches, and to send those batches in order, but we tolerate replays), but if so we would expect to see the syscalls happening in a different order
  • Something clobbers i.highestDelivered during the loop, causing the later messages to be skipped. There are no awaits in this code, so there's no obvious reentrancy hazard. E() queues behavior for later. D() causes an immediate blocking syscall
  • my wild idea is that somehow the blocking syscall.callNow invoked by D(), as implemented by xsnap (it turns into a blocking read on the message pipe) is interacting badly with the setImmediate we use inside liveslots to sense end-of-crank, and making liveslots think the crank is done early, making the syscalls for the remaining four messages get lost
  • some sort of metering fault might have interrupted the vat, but the deliver-results showed a normal crank result, not a meter exhaustion error
  • if num or i.highestDelivered weren't both BigInts, I could imagine a lexicographic (string) comparison breaking things, but the comparison should have been between 726 and 725, which is not e.g. 9 vs 10.

@warner
Copy link
Member Author

warner commented May 6, 2021

@FUDCo and I are copying the kernelDB from a chain run that hit this failure, and are attempting to replay the swingset transcript for the vat in question, to see if it does the same thing upon replay. If so, it suggests that, whatever is going on, it's at least a function of the state of that vat.

The particular failure we're investigating is in a run we've named bug3021-05may. It occurs in a deliverInboundMessages call to vattp which receives a batch of seqnums 9239+9240+9241, but only emits syscalls for 9239 before finishing.

The "in-vivo" replace I did (by just restarting the chain, after adding a log message to show when syscalls are being made during replay, and replacing the kernelBundle key of the DB so the addition would take effect) shows that the missing syscalls are also missing during replay. Yay! This suggests that we might be able to reproduce it "in-vitro": @FUDCo is working on that with swingset-runner. That'd be great, because restarting the chain with it's current DB takes 18 minutes, and swingset-runner ought to be somewhat faster. Especially if we modify the DB a little more and make it forget about the vats we don't care about.

@warner
Copy link
Member Author

warner commented May 6, 2021

We're able to reproduce the problem in a somewhat-reduced test environment, where we copy the kernelDB from the failing chain, strip out all the vats except for vattp, and re-run it under swingset-runner. This takes about 2.5min to reproduce, whereas the original chain needed 12 hours of once-a-minute traffic to break.

The problem manifests at a simple forEach loop, working on an Array of either 1 or 3 entries (which came from marshal, so ultimately from JSON.parse), which, sometimes, exits early for no good reason.

The fun part is that the crank which exhibits the error changes as a function of the code that xsnap is given (supervisor, liveslots, and vat code). We updated our transcript replay code to not only sense if a vat makes a syscall which differs from the ones recorded in the transcript, but also it fails to make one of the recorded ones. If we run the original code, we see no transcript mismatches, because the crank that showed the error managed to show exactly the same error in the replay. If we modify the code a bit, and the bug happens during some different (earlier) crank, then we see fewer syscalls than we expected, and the replay terminates early with an error.

The good news is that, from what I can tell, it's a deterministic function of that code, so at least the bug is stable, but it tends to move or go away if you add debugging code. It seems to be insensitive to whitespace.

Our current theory is some sort of memory allocation bug, maybe during GC. Maybe something which is causing the short array to get truncated, or to corrupt the variable holding i.highestDelivered. It took a long time to manifest (12 hours, tens of thousands of messages), and happened in the one XS vat that performs the most work, which means there was more allocation (and more of everything) there.

@dckc dckc added the xsnap the XS execution tool label May 6, 2021
@dckc
Copy link
Member

dckc commented May 6, 2021

The most concise way I have found for packaging up such things is a stand-alone Makefile that requires only that node and yarn are installed; for example:

https://gist.github.com/dckc/19451649697bb7cb231ab5a776fde0d1#file-makefile

p.s. This assumes the bug will show up on machines other than your own. We've seen bugs where the symptoms were hard to reproduce on another machine; the culprit seemed to be that paths were different between machines and the paths made their way into the runtime heap.

@warner
Copy link
Member Author

warner commented May 7, 2021

Yeah, that's where I'll aim. At @FUDCo 's advice, I'm going to start big+fast (ala "I regret this letter is so long, I did not have the time to make it shorter"), and defer refinement/reduction until after we've delivered something sufficient for them to reproduce it.

@dckc
Copy link
Member

dckc commented May 13, 2021

When I tried on Monday to reproduce this, I got anachrophobia strikes vat v14; I learned that this means that I did not manage to reproduce it.

@dckc
Copy link
Member

dckc commented May 17, 2021

@warner Moddable sent us a WeakMap GC fix; I checked that in as agoric-labs/moddable@c533fb7 ( on a https://github.com/agoric-labs/moddable/tree/weak-gc-fix branch) and I created a https://github.com/Agoric/agoric-sdk/tree/3039-vattp-xs-gc branch that is master as of May 5 notede above ( f052169 ) with the moddable submodule set to the GC fix.

Care to try it out? Or maybe I could figure out how...

@erights
Copy link
Member

erights commented May 27, 2021

Which weakmap gc problem is this supposed to fix?

@dckc
Copy link
Member

dckc commented May 28, 2021

Which weakmap gc problem is this supposed to fix?

Er... The one where vattp sometimes drops messages, as detailed at length above.

@rowgraus
Copy link

rowgraus commented Jun 21, 2021

In today's testnet meeting:
BW: we have a change that's supposed to fix this but we haven't verified it
DC: right; I think it's fixed but I'm not certain.
BW: probably ok to release without further work on this

Deprioritizing for now

@rowgraus rowgraus removed this from the Testnet: Stress Test Phase milestone Jun 21, 2021
@dckc
Copy link
Member

dckc commented Aug 2, 2021

@warner do I have access to the bug3021-05may logs? If not, please use the 1st key in https://github.com/dckc.keys to grant me access?

And please sketch how to use swingset-runner to reproduce the problem "in vitrio"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working SwingSet package: SwingSet vaults_triage DO NOT USE xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

6 participants