consensus failure in testnet4.5 on RPC node peer-2 at block 9292 #3739

warner · 2021-08-21T02:48:58Z

@michaelfig reported one of our testnet4.5 ("metering phase") RPC nodes halted with a consensus error at block 9292. With the new #3442 kernel activityhash in place, the node will stop just before it commits the non-conforming swingset state changes.

I looked at the slogfiles from the failed node and from one that continued to work. The difference is that the failed node was given an extra cosmic-swingset-deliver-inbound in block 9292. This is what swingset sees in response to a cosmos-level signed transaction coming from a solo node. The host calls mailbox.deliverInbound(peer, msg), and the mailbox device pushes a delivery to vattp onto the run-queue. Then, when controller.run() is executed, this delivery is made. The specific delivery (as reported in the slogfile) was:

{
  "type": "crank-start",
  "message": {
    "type": "send",
    "target": "ko33",
    "msg": {
      "method": "deliverInboundMessages",
      "args": {
        "body": "[\"agoric19ta405rhge63q4delmuy376azee3jsr4x85k67\",[[13,\"13:11:deliver:ro+51033:makeEmptyPurse:rp-50;[]\"]]]",
        "slots": []
      },
      "result": null
    }
  }
}

In the non-failing ("good") node, this same delivery appears in block 9294. The cosmos txn which provides this delivery also appears in block 9294 (according to ag-chain-cosmos query block 9294 on a monitoring node that has not observed any problems).

I'm really puzzled as to how this message could be delivered two blocks early. If the failing node saw the message arrive late, then I might suspect a bug in the cosmic-swingset block-manager, causing a queued message to be missed, and delivered on the next block with any messages (note that 9293 was empty).

But for the failing node to see the delivery earlier than the succeeding/consensus nodes.. one weak hypothesis is that we've got a block-manager bug that fails far more often than not, and the "failing" node is actually the one validator which didn't see the delay. That seems really unlikely.

An even weaker hypothesis is that some block proposer suggested a 9292-A which includes the delivery, but the vote failed, but the "bad" RPC node thought it succeeded, and executed the block anyways.

The same thing seemed to happen to a second RPC node, at block 10200. In this case, the extra message was:

{
  "type": "crank-start",
  "message": {
    "type": "send",
    "target": "ko33",
    "msg": {
      "method": "deliverInboundMessages",
      "args": {
        "body": "[\"agoric14pz3f6ae0jgqtxt484z2s4ezwk4r8lupu9d2c5\",[[2646,\"2646:2632:deliver:ro+45626:getPayout:rp-2332;[\\\"Collateral\\\"]\"],[2647,\"2647:2632:deliver:ro+45626:getPayout:rp-2333;[\\\"RUN\\\"]\"]]]",
        "slots": []
      },
      "result": null
    }
  }
}

in this case the monitoring node saw the corresponding cosmos txn in block 10202.

cc @michaelfig

The text was updated successfully, but these errors were encountered:

warner · 2021-08-21T03:31:26Z

hm, it's likely that these RPC nodes were in used by @mhofman 's load-generator, and these might be loadgen transactions, which means they might have been the very first nodes anywhere to see the txns, in case that provides any clues

mhofman · 2021-08-21T03:34:17Z

Actually i think my loadgen client is using your RPC node. Do we know around what time the concensus failures happened?

michaelfig · 2021-08-21T03:59:17Z

I have a hypothesis. It looks like the transaction simulation on the golang side may have been causing an actual SwingSet delivery. I'm working on this now to confirm.

michaelfig · 2021-08-21T04:41:32Z

I have a hypothesis. It looks like the transaction simulation on the golang side may have been causing an actual SwingSet delivery. I'm working on this now to confirm.

I've confirmed this and have a PR in the works to make it extremely difficult to accidentally invoke the SwingSet controller during a transaction simulation.

warner added the bug Something isn't working label Aug 21, 2021

michaelfig mentioned this issue Aug 21, 2021

fix(cosmos): ensure simulated transactions can't trigger SwingSet #3740

Merged

michaelfig closed this as completed in #3740 Aug 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consensus failure in testnet4.5 on RPC node peer-2 at block 9292 #3739

consensus failure in testnet4.5 on RPC node peer-2 at block 9292 #3739

warner commented Aug 21, 2021

warner commented Aug 21, 2021

mhofman commented Aug 21, 2021 via email

michaelfig commented Aug 21, 2021

michaelfig commented Aug 21, 2021

consensus failure in testnet4.5 on RPC node peer-2 at block 9292 #3739

consensus failure in testnet4.5 on RPC node peer-2 at block 9292 #3739

Comments

warner commented Aug 21, 2021

warner commented Aug 21, 2021

mhofman commented Aug 21, 2021 via email

michaelfig commented Aug 21, 2021

michaelfig commented Aug 21, 2021