[refactor / possible bug] worker synchronizer to reply back for available batches #837

akichidis · 2022-08-22T17:49:38Z

Section 1

Currently when we request from the worker synchronizer to synchronize a batch (PrimaryWorkerMessage::Synchronize) it will first check in storage whether the batch is already there.

If not , it will ask from other worker peers to fetch. Primary will be notified about the fetched batch id once the peer worker has sent it
If yes, it will basically do nothing about it. It probably assumes that since it has been found in storage then the batch has been received by a peer worker and the primary will be notified anyways via the processor will handle it . Now if for some reason this notification message from worker --> primary has not been successfully processed for some reason, there is no mechanism to re-trigger and there will be no way for the primary to get notified about the batch availability - although the batch is in worker.

Section 2

The above case (2) can create deadlock situations. In DevNet we have recently seen an issue with the payload synchronization in the recent incident where I believe is an artifact of the above behaviour. The scenario is the following:

We assume that we have two validators, (A) & (B).

Validator (A) worker creates a batch with the hypothetical hash kxmsamyzXZxRPc+C and forwards to primary
Validator (A) primary includes this to a header proposal. Let's assume that a certificate is produced, sequenced and even executed. For the execution we fetch the payload via the block_waiter which first checks if the payload is available via the block_synchronizer . The block_synchronizer sees that the header has been created by us (Validator A) and doesn't need to check the near store cache payload_store , but rather assumes the payload is already in worker - so the block_waiter can just continue and fetch the batch - all OK.
Now Validator (B) worker creates a batch with the exact same hash kxmsamyzXZxRPc+C - yes it happens currently in DevNet currently (since same transactions are posted) . Worker broadcasts the batch to the other workers and just waits for a quorum of them to reply back - once the quorum replies it even cancels the network handlers. Let's assume that because of that - or even some network glitch - Validator (A) hasn't received the batch.
Because Validator (A) hasn't received validator's B batch it will never initiate any process to forward the batch id to the primary node . So the batch id kxmsamyzXZxRPc+C is still missing from the Validator (A) payload_store
Now Validator (B) creates a header -> certificate -> sequenced and needs to get executed. Validator (A) tries to execute the certificate which references to the batch id kxmsamyzXZxRPc+C . The block_synchronizer will check the payload_store for the batch availability (since it's not our header) . The batch is missing , so it will dictate its worker to synchronize it from the other worker peers.
Because of the current behaviour that described in Section 1 the primary will never manage to successfully get the notification about the payload availability as there is not way to re-trigger it.
Result: Validator (A) stops its execution and blocks forever

From the DevNet the bellow logs tell that story - well in a quite abstract way , but that's where I am oriented at:

Validator A

We see that Validator A it will indefinitely send via the block_synchronizer a request to its worker to sync the batch kxmsamyzXZxRPc+C.....

Validator B

We see that Validator B is managing to successfully progress and request the batch from its worker (see log line primary::block_waiter: Sending batch kxmsamyzXZxRPc+C request to worker id 0

Resolution

This PR attempts to resolve this issue by making the worker synchronizer reply back to primary immediately when a batch has been found available in store and not make any assumptions about other processes that might be running to notify the primary.

Also, an addition fix/improvement is made to ensure that the synchronizer will not send synchronization messages to other workers when a vector of missing batch ids is empty - we have also seen this happening in DevNet.

PS: Similar issue with the above can arise even when the worker restarts (so any pending requests from worker to primary are lost) or for some reason primary doesn't store the batch id in the payload_store

TODO

Follow up PR to store all the batch ids in the near cache payload_store in primary irrespective of wether "we" or an "other" primary has created. Will need to confirm we'll not break there some other assumptions - which I believe it won't

huitseeker

Thanks @akichidis this looks great, and it does cover an important edge case! I would simply ask for a test on the PayloadReceiver checking that sending twice the same (digest, worker_id) pair results in no errors (and the pair being written to the store).
It sounds trite, but it would make it explicit that we do expect idempotent behavior from this component.

Nit:
This risks invalidating the batches_received metric, and if we want to be really fancy we can increase the PayloadToken in the PayloadStore to reflect we are adding the same digest several times, but that's probably an overkill (and will be refactored away if we make authorities commit to batch rounds).

huitseeker · 2022-08-23T01:14:55Z

worker/src/synchronizer.rs

                                    debug!("Requesting sync for batch {digest}");
                                },
                                Ok(Some(_)) => {
                                    // The batch arrived in the meantime: no need to request it.
+                                    available.insert(*digest);
+                                    debug!("Digest {digest} already in store, nothing to sync");


This may end up being quite verbose, do we need this log at any level above trace?

Yeah I know it will be verbose, but given the current behaviour in DevNet I err on the side of having extra logs to be able to trace any related issues - it was a bit hard to diagnose with the current level of logging. That being said though, I'll change this to trace and we can enable in worker the trace level - and switch back if we don't need to any more.

huitseeker · 2022-08-23T01:23:52Z

worker/src/synchronizer.rs

+                            // worker and message has been missed by primary).
+                            for digest in available {
+                                let message = WorkerPrimaryMessage::OthersBatch(digest, self.id);
+                                _ = self.tx_primary.send(message).await.tap(|err|{


The send returns a Result<(), SendError<T>> so I think you want to use tap_err here, as this closure would be applied on the whole Result otherwise.

Good point!

akichidis · 2022-08-23T10:56:05Z

Thanks @akichidis this looks great, and it does cover an important edge case! I would simply ask for a test on the PayloadReceiver checking that sending twice the same (digest, worker_id) pair results in no errors (and the pair being written to the store). It sounds trite, but it would make it explicit that we do expect idempotent behavior from this component.

Nit: This risks invalidating the batches_received metric, and if we want to be really fancy we can increase the PayloadToken in the PayloadStore to reflect we are adding the same digest several times, but that's probably an overkill (and will be refactored away if we make authorities commit to batch rounds).

Very good point @huitseeker ! I've amended the PR with tests on the payload_receiver . Also enhanced the tests on the worker processor as well to ensure the idempotent behaviour there as well.

…revent from sending unnecessary sync requests

akichidis · 2022-08-24T09:08:50Z

Nit: This risks invalidating the batches_received metric, and if we want to be really fancy we can increase the PayloadToken in the PayloadStore to reflect we are adding the same digest several times, but that's probably an overkill (and will be refactored away if we make authorities commit to batch rounds).

@huitseeker Ah yeah this could create some skew, but to be honest I don't expect to affect that much the grand scheme of things here (but current logging can confirm the expectation). Also as you said the refactoring to make the authorities commit to batch rounds will introduce several changes, so we can have a look as part of that - or after that.

…able batches (MystenLabs#837) Refactored worker synchronizer to reply back immediately when batch available to ensure we'll not end up in deadlock situations in primary. Also refactored the code to not send sync requests to peer workers with empty payload.

…able batches (#837) Refactored worker synchronizer to reply back immediately when batch available to ensure we'll not end up in deadlock situations in primary. Also refactored the code to not send sync requests to peer workers with empty payload.

…able batches (MystenLabs/narwhal#837) Refactored worker synchronizer to reply back immediately when batch available to ensure we'll not end up in deadlock situations in primary. Also refactored the code to not send sync requests to peer workers with empty payload.

akichidis requested review from huitseeker, asonnino and arun-koshy August 22, 2022 18:04

huitseeker approved these changes Aug 23, 2022

View reviewed changes

akichidis added 2 commits August 23, 2022 12:13

worker synchronizer to reply back immediately when batch available. P…

8fdb2ad

…revent from sending unnecessary sync requests

fix test after rebase

4b141ff

akichidis force-pushed the refactor/worker-reply-on-sync-messages branch from cc6578d to 4b141ff Compare August 23, 2022 11:33

akichidis merged commit 437b3b2 into main Aug 24, 2022

akichidis deleted the refactor/worker-reply-on-sync-messages branch August 24, 2022 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[refactor / possible bug] worker synchronizer to reply back for available batches #837

[refactor / possible bug] worker synchronizer to reply back for available batches #837

akichidis commented Aug 22, 2022 •

edited

Loading

huitseeker left a comment

huitseeker Aug 23, 2022

akichidis Aug 23, 2022

huitseeker Aug 23, 2022

akichidis Aug 23, 2022

akichidis commented Aug 23, 2022

akichidis commented Aug 24, 2022

[refactor / possible bug] worker synchronizer to reply back for available batches #837

[refactor / possible bug] worker synchronizer to reply back for available batches #837

Conversation

akichidis commented Aug 22, 2022 • edited Loading

Section 1

Section 2

Resolution

TODO

huitseeker left a comment

Choose a reason for hiding this comment

huitseeker Aug 23, 2022

Choose a reason for hiding this comment

akichidis Aug 23, 2022

Choose a reason for hiding this comment

huitseeker Aug 23, 2022

Choose a reason for hiding this comment

akichidis Aug 23, 2022

Choose a reason for hiding this comment

akichidis commented Aug 23, 2022

akichidis commented Aug 24, 2022

akichidis commented Aug 22, 2022 •

edited

Loading