You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@michaelfig reported one of our testnet4.5 ("metering phase") RPC nodes halted with a consensus error at block 9292. With the new #3442 kernel activityhash in place, the node will stop just before it commits the non-conforming swingset state changes.
I looked at the slogfiles from the failed node and from one that continued to work. The difference is that the failed node was given an extra cosmic-swingset-deliver-inbound in block 9292. This is what swingset sees in response to a cosmos-level signed transaction coming from a solo node. The host calls mailbox.deliverInbound(peer, msg), and the mailbox device pushes a delivery to vattp onto the run-queue. Then, when controller.run() is executed, this delivery is made. The specific delivery (as reported in the slogfile) was:
In the non-failing ("good") node, this same delivery appears in block 9294. The cosmos txn which provides this delivery also appears in block 9294 (according to ag-chain-cosmos query block 9294 on a monitoring node that has not observed any problems).
I'm really puzzled as to how this message could be delivered two blocks early. If the failing node saw the message arrive late, then I might suspect a bug in the cosmic-swingset block-manager, causing a queued message to be missed, and delivered on the next block with any messages (note that 9293 was empty).
But for the failing node to see the delivery earlier than the succeeding/consensus nodes.. one weak hypothesis is that we've got a block-manager bug that fails far more often than not, and the "failing" node is actually the one validator which didn't see the delay. That seems really unlikely.
An even weaker hypothesis is that some block proposer suggested a 9292-A which includes the delivery, but the vote failed, but the "bad" RPC node thought it succeeded, and executed the block anyways.
The same thing seemed to happen to a second RPC node, at block 10200. In this case, the extra message was:
hm, it's likely that these RPC nodes were in used by @mhofman 's load-generator, and these might be loadgen transactions, which means they might have been the very first nodes anywhere to see the txns, in case that provides any clues
I have a hypothesis. It looks like the transaction simulation on the golang side may have been causing an actual SwingSet delivery. I'm working on this now to confirm.
I have a hypothesis. It looks like the transaction simulation on the golang side may have been causing an actual SwingSet delivery. I'm working on this now to confirm.
I've confirmed this and have a PR in the works to make it extremely difficult to accidentally invoke the SwingSet controller during a transaction simulation.
@michaelfig reported one of our testnet4.5 ("metering phase") RPC nodes halted with a consensus error at block 9292. With the new #3442 kernel activityhash in place, the node will stop just before it commits the non-conforming swingset state changes.
I looked at the slogfiles from the failed node and from one that continued to work. The difference is that the failed node was given an extra
cosmic-swingset-deliver-inbound
in block 9292. This is what swingset sees in response to a cosmos-level signed transaction coming from a solo node. The host callsmailbox.deliverInbound(peer, msg)
, and the mailbox device pushes a delivery to vattp onto the run-queue. Then, whencontroller.run()
is executed, this delivery is made. The specific delivery (as reported in the slogfile) was:In the non-failing ("good") node, this same delivery appears in block 9294. The cosmos txn which provides this delivery also appears in block 9294 (according to
ag-chain-cosmos query block 9294
on a monitoring node that has not observed any problems).I'm really puzzled as to how this message could be delivered two blocks early. If the failing node saw the message arrive late, then I might suspect a bug in the cosmic-swingset block-manager, causing a queued message to be missed, and delivered on the next block with any messages (note that 9293 was empty).
But for the failing node to see the delivery earlier than the succeeding/consensus nodes.. one weak hypothesis is that we've got a block-manager bug that fails far more often than not, and the "failing" node is actually the one validator which didn't see the delay. That seems really unlikely.
An even weaker hypothesis is that some block proposer suggested a 9292-A which includes the delivery, but the vote failed, but the "bad" RPC node thought it succeeded, and executed the block anyways.
The same thing seemed to happen to a second RPC node, at block 10200. In this case, the extra message was:
in this case the monitoring node saw the corresponding cosmos txn in block 10202.
cc @michaelfig
The text was updated successfully, but these errors were encountered: