Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consensus failure in testnet4.5 on RPC node peer-2 at block 9292 #3739

Closed
warner opened this issue Aug 21, 2021 · 4 comments · Fixed by #3740
Closed

consensus failure in testnet4.5 on RPC node peer-2 at block 9292 #3739

warner opened this issue Aug 21, 2021 · 4 comments · Fixed by #3740
Labels
bug Something isn't working

Comments

@warner
Copy link
Member

warner commented Aug 21, 2021

@michaelfig reported one of our testnet4.5 ("metering phase") RPC nodes halted with a consensus error at block 9292. With the new #3442 kernel activityhash in place, the node will stop just before it commits the non-conforming swingset state changes.

I looked at the slogfiles from the failed node and from one that continued to work. The difference is that the failed node was given an extra cosmic-swingset-deliver-inbound in block 9292. This is what swingset sees in response to a cosmos-level signed transaction coming from a solo node. The host calls mailbox.deliverInbound(peer, msg), and the mailbox device pushes a delivery to vattp onto the run-queue. Then, when controller.run() is executed, this delivery is made. The specific delivery (as reported in the slogfile) was:

{
  "type": "crank-start",
  "message": {
    "type": "send",
    "target": "ko33",
    "msg": {
      "method": "deliverInboundMessages",
      "args": {
        "body": "[\"agoric19ta405rhge63q4delmuy376azee3jsr4x85k67\",[[13,\"13:11:deliver:ro+51033:makeEmptyPurse:rp-50;[]\"]]]",
        "slots": []
      },
      "result": null
    }
  }
}

In the non-failing ("good") node, this same delivery appears in block 9294. The cosmos txn which provides this delivery also appears in block 9294 (according to ag-chain-cosmos query block 9294 on a monitoring node that has not observed any problems).

I'm really puzzled as to how this message could be delivered two blocks early. If the failing node saw the message arrive late, then I might suspect a bug in the cosmic-swingset block-manager, causing a queued message to be missed, and delivered on the next block with any messages (note that 9293 was empty).

But for the failing node to see the delivery earlier than the succeeding/consensus nodes.. one weak hypothesis is that we've got a block-manager bug that fails far more often than not, and the "failing" node is actually the one validator which didn't see the delay. That seems really unlikely.

An even weaker hypothesis is that some block proposer suggested a 9292-A which includes the delivery, but the vote failed, but the "bad" RPC node thought it succeeded, and executed the block anyways.

The same thing seemed to happen to a second RPC node, at block 10200. In this case, the extra message was:

{
  "type": "crank-start",
  "message": {
    "type": "send",
    "target": "ko33",
    "msg": {
      "method": "deliverInboundMessages",
      "args": {
        "body": "[\"agoric14pz3f6ae0jgqtxt484z2s4ezwk4r8lupu9d2c5\",[[2646,\"2646:2632:deliver:ro+45626:getPayout:rp-2332;[\\\"Collateral\\\"]\"],[2647,\"2647:2632:deliver:ro+45626:getPayout:rp-2333;[\\\"RUN\\\"]\"]]]",
        "slots": []
      },
      "result": null
    }
  }
}

in this case the monitoring node saw the corresponding cosmos txn in block 10202.

cc @michaelfig

@warner warner added the bug Something isn't working label Aug 21, 2021
@warner
Copy link
Member Author

warner commented Aug 21, 2021

hm, it's likely that these RPC nodes were in used by @mhofman 's load-generator, and these might be loadgen transactions, which means they might have been the very first nodes anywhere to see the txns, in case that provides any clues

@mhofman
Copy link
Member

mhofman commented Aug 21, 2021 via email

@michaelfig
Copy link
Member

I have a hypothesis. It looks like the transaction simulation on the golang side may have been causing an actual SwingSet delivery. I'm working on this now to confirm.

@michaelfig
Copy link
Member

I have a hypothesis. It looks like the transaction simulation on the golang side may have been causing an actual SwingSet delivery. I'm working on this now to confirm.

I've confirmed this and have a PR in the works to make it extremely difficult to accidentally invoke the SwingSet controller during a transaction simulation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants