khepri_cluster: Fix race condition in the reset code #294
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why
The
khepri_cluster:reset()
function is mainly used to uncluster a node. To make sure that both parties (the leaving node and the rest of the cluster) see the same cluster membership in the end, we perform two resets:This way, if the leaving node was out-of-sync about the cluster membership because it lost its state for instance, we are sure that at the end, everyone agrees.
However, when the leaving node is removed using a remote member, that member will stop the leaving Ra server. Therefore, when we try the second remove on the leaving node, we might get a
{error, noproc}
error because the Ra process already exited.How
We adopt the same solution as the error handling done with
wait_for_leader()
: ifra:remove_member()
returns thenoproc
error, we consider that's ok and proceed with the reset.This should fix a rare transient failure that I saw in CI but was never able to reproduce locally.