Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(swingset): don't deduplicate inbound mailbox messages #3492

Merged
merged 2 commits into from
Jul 22, 2021

Conversation

warner
Copy link
Member

@warner warner commented Jul 19, 2021

The mailbox device tracking the highest inbound message number and ack for
each peer, to de-duplicate repeated messages, so it could reduce the amount
of kernel activity. Each call to deliverInbound would return a boolean to
indicate whether the messages/ack were new, and thus the kernel needed to be
cycled.

However, the device was holding this tracking data in non-durable state, so
if/when the kernel was restarted, the state would be lost. A duplicate
message/ack arriving in the restarted process would trigger kernel activity
that would not have run in the original process. These extra cranks caused
diverge between validators when one of them was restarted, and the client
sent a duplicate message (such as the pre-emptive ack all clients send at
startup). The extra crank does not get very far, because vattp does its own
deduplication, so the divergence was only visible in the slog. But when #3442
is implemented, even a single extra crank will flag the validator as out of
consensus.

The fix is to remove the mailbox device's dedup code, and rely upon vattp for
this function. The test was also updated to match, and a new test (comparing
two parallel kernels, one restarted, one not) was added.

closes #3471

@warner warner added the SwingSet package: SwingSet label Jul 19, 2021
@warner warner added this to the Testnet: Metering Phase milestone Jul 19, 2021
@warner warner requested a review from FUDCo July 19, 2021 06:36
@warner warner self-assigned this Jul 19, 2021
@warner
Copy link
Member Author

warner commented Jul 19, 2021

@FUDCo I recommend reviewing the two commits separately, the first is just refactoring.

Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fewer bugs in less code. What's not to like?

@warner warner force-pushed the 3471-mailbox-no-dedup branch 3 times, most recently from 7b8886c to 950901e Compare July 22, 2021 06:55
@warner warner enabled auto-merge July 22, 2021 06:55
no behavioral changes, although the two tests now run in parallel whereas
before they were serialized (probably unnecessarily)
The mailbox device tracking the highest inbound message number and ack for
each peer, to de-duplicate repeated messages, so it could reduce the amount
of kernel activity. Each call to `deliverInbound` would return a boolean to
indicate whether the messages/ack were new, and thus the kernel needed to be
cycled.

However, the device was holding this tracking data in non-durable state, so
if/when the kernel was restarted, the state would be lost. A duplicate
message/ack arriving in the restarted process would trigger kernel activity
that would not have run in the original process. These extra cranks caused
diverge between validators when one of them was restarted, and the client
sent a duplicate message (such as the pre-emptive `ack` all clients send at
startup). The extra crank does not get very far, because vattp does its own
deduplication, so the divergence was only visible in the slog. But when #3442
is implemented, even a single extra crank will flag the validator as out of
consensus.

The fix is to remove the mailbox device's dedup code, and rely upon vattp for
this function. The test was also updated to match, and a new test (comparing
two parallel kernels, one restarted, one not) was added.

closes #3471
@warner warner merged commit c18908b into master Jul 22, 2021
@warner warner deleted the 3471-mailbox-no-dedup branch July 22, 2021 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SwingSet package: SwingSet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mailbox device nondeterminism after kernel restart
2 participants