Skip to content
This repository has been archived by the owner on Oct 17, 2022. It is now read-only.

[refactor / possible bug] worker synchronizer to reply back for available batches #837

Merged
merged 2 commits into from
Aug 24, 2022

Conversation

akichidis
Copy link
Contributor

@akichidis akichidis commented Aug 22, 2022

Section 1

Currently when we request from the worker synchronizer to synchronize a batch (PrimaryWorkerMessage::Synchronize) it will first check in storage whether the batch is already there.

  1. If not , it will ask from other worker peers to fetch. Primary will be notified about the fetched batch id once the peer worker has sent it
  2. If yes, it will basically do nothing about it. It probably assumes that since it has been found in storage then the batch has been received by a peer worker and the primary will be notified anyways via the processor will handle it . Now if for some reason this notification message from worker --> primary has not been successfully processed for some reason, there is no mechanism to re-trigger and there will be no way for the primary to get notified about the batch availability - although the batch is in worker.

Section 2

The above case (2) can create deadlock situations. In DevNet we have recently seen an issue with the payload synchronization in the recent incident where I believe is an artifact of the above behaviour. The scenario is the following:

We assume that we have two validators, (A) & (B).

  • Validator (A) worker creates a batch with the hypothetical hash kxmsamyzXZxRPc+C and forwards to primary
  • Validator (A) primary includes this to a header proposal. Let's assume that a certificate is produced, sequenced and even executed. For the execution we fetch the payload via the block_waiter which first checks if the payload is available via the block_synchronizer . The block_synchronizer sees that the header has been created by us (Validator A) and doesn't need to check the near store cache payload_store , but rather assumes the payload is already in worker - so the block_waiter can just continue and fetch the batch - all OK.
  • Now Validator (B) worker creates a batch with the exact same hash kxmsamyzXZxRPc+C - yes it happens currently in DevNet currently (since same transactions are posted) . Worker broadcasts the batch to the other workers and just waits for a quorum of them to reply back - once the quorum replies it even cancels the network handlers. Let's assume that because of that - or even some network glitch - Validator (A) hasn't received the batch.
  • Because Validator (A) hasn't received validator's B batch it will never initiate any process to forward the batch id to the primary node . So the batch id kxmsamyzXZxRPc+C is still missing from the Validator (A) payload_store
  • Now Validator (B) creates a header -> certificate -> sequenced and needs to get executed. Validator (A) tries to execute the certificate which references to the batch id kxmsamyzXZxRPc+C . The block_synchronizer will check the payload_store for the batch availability (since it's not our header) . The batch is missing , so it will dictate its worker to synchronize it from the other worker peers.
  • Because of the current behaviour that described in Section 1 the primary will never manage to successfully get the notification about the payload availability as there is not way to re-trigger it.
  • Result: Validator (A) stops its execution and blocks forever

From the DevNet the bellow logs tell that story - well in a quite abstract way , but that's where I am oriented at:

Validator A
Screenshot 2022-08-22 at 19 06 53

We see that Validator A it will indefinitely send via the block_synchronizer a request to its worker to sync the batch kxmsamyzXZxRPc+C.....

Validator B
Screenshot 2022-08-22 at 19 07 02

We see that Validator B is managing to successfully progress and request the batch from its worker (see log line primary::block_waiter: Sending batch kxmsamyzXZxRPc+C request to worker id 0

Resolution

This PR attempts to resolve this issue by making the worker synchronizer reply back to primary immediately when a batch has been found available in store and not make any assumptions about other processes that might be running to notify the primary.

Also, an addition fix/improvement is made to ensure that the synchronizer will not send synchronization messages to other workers when a vector of missing batch ids is empty - we have also seen this happening in DevNet.

PS: Similar issue with the above can arise even when the worker restarts (so any pending requests from worker to primary are lost) or for some reason primary doesn't store the batch id in the payload_store

TODO

  • Follow up PR to store all the batch ids in the near cache payload_store in primary irrespective of wether "we" or an "other" primary has created. Will need to confirm we'll not break there some other assumptions - which I believe it won't

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @akichidis this looks great, and it does cover an important edge case! I would simply ask for a test on the PayloadReceiver checking that sending twice the same (digest, worker_id) pair results in no errors (and the pair being written to the store).
It sounds trite, but it would make it explicit that we do expect idempotent behavior from this component.

Nit:
This risks invalidating the batches_received metric, and if we want to be really fancy we can increase the PayloadToken in the PayloadStore to reflect we are adding the same digest several times, but that's probably an overkill (and will be refactored away if we make authorities commit to batch rounds).

debug!("Requesting sync for batch {digest}");
},
Ok(Some(_)) => {
// The batch arrived in the meantime: no need to request it.
available.insert(*digest);
debug!("Digest {digest} already in store, nothing to sync");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may end up being quite verbose, do we need this log at any level above trace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I know it will be verbose, but given the current behaviour in DevNet I err on the side of having extra logs to be able to trace any related issues - it was a bit hard to diagnose with the current level of logging. That being said though, I'll change this to trace and we can enable in worker the trace level - and switch back if we don't need to any more.

// worker and message has been missed by primary).
for digest in available {
let message = WorkerPrimaryMessage::OthersBatch(digest, self.id);
_ = self.tx_primary.send(message).await.tap(|err|{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The send returns a Result<(), SendError<T>> so I think you want to use tap_err here, as this closure would be applied on the whole Result otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

@akichidis
Copy link
Contributor Author

Thanks @akichidis this looks great, and it does cover an important edge case! I would simply ask for a test on the PayloadReceiver checking that sending twice the same (digest, worker_id) pair results in no errors (and the pair being written to the store). It sounds trite, but it would make it explicit that we do expect idempotent behavior from this component.

Nit: This risks invalidating the batches_received metric, and if we want to be really fancy we can increase the PayloadToken in the PayloadStore to reflect we are adding the same digest several times, but that's probably an overkill (and will be refactored away if we make authorities commit to batch rounds).

Very good point @huitseeker ! I've amended the PR with tests on the payload_receiver . Also enhanced the tests on the worker processor as well to ensure the idempotent behaviour there as well.

@akichidis akichidis force-pushed the refactor/worker-reply-on-sync-messages branch from cc6578d to 4b141ff Compare August 23, 2022 11:33
@akichidis
Copy link
Contributor Author

Nit: This risks invalidating the batches_received metric, and if we want to be really fancy we can increase the PayloadToken in the PayloadStore to reflect we are adding the same digest several times, but that's probably an overkill (and will be refactored away if we make authorities commit to batch rounds).

@huitseeker Ah yeah this could create some skew, but to be honest I don't expect to affect that much the grand scheme of things here (but current logging can confirm the expectation). Also as you said the refactoring to make the authorities commit to batch rounds will introduce several changes, so we can have a look as part of that - or after that.

@akichidis akichidis merged commit 437b3b2 into main Aug 24, 2022
@akichidis akichidis deleted the refactor/worker-reply-on-sync-messages branch August 24, 2022 09:09
huitseeker pushed a commit to huitseeker/narwhal that referenced this pull request Aug 25, 2022
…able batches (MystenLabs#837)

Refactored worker synchronizer to reply back immediately when batch available to ensure we'll not end up in deadlock situations in primary. Also refactored the code to not send sync requests to peer workers with empty payload.
huitseeker pushed a commit that referenced this pull request Aug 25, 2022
…able batches (#837)

Refactored worker synchronizer to reply back immediately when batch available to ensure we'll not end up in deadlock situations in primary. Also refactored the code to not send sync requests to peer workers with empty payload.
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 30, 2022
…able batches (MystenLabs/narwhal#837)

Refactored worker synchronizer to reply back immediately when batch available to ensure we'll not end up in deadlock situations in primary. Also refactored the code to not send sync requests to peer workers with empty payload.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants