khepri_cluster: Fix race condition in the reset code #294

dumbbell · 2024-09-11T16:13:39Z

Why

The khepri_cluster:reset() function is mainly used to uncluster a node. To make sure that both parties (the leaving node and the rest of the cluster) see the same cluster membership in the end, we perform two resets:

We remove the leaving member from a remote member (if the node is clustered).
We remove the leaving member from its own view of the cluster.

This way, if the leaving node was out-of-sync about the cluster membership because it lost its state for instance, we are sure that at the end, everyone agrees.

However, when the leaving node is removed using a remote member, that member will stop the leaving Ra server. Therefore, when we try the second remove on the leaving node, we might get a {error, noproc} error because the Ra process already exited.

How

We adopt the same solution as the error handling done with wait_for_leader(): if ra:remove_member() returns the noproc error, we consider that's ok and proceed with the reset.

This should fix a rare transient failure that I saw in CI but was never able to reproduce locally.

[Why] The `khepri_cluster:reset()` function is mainly used to uncluster a node. To make sure that both parties (the leaving node and the rest of the cluster) see the same cluster membership in the end, we perform two resets: 1. We remove the leaving member from a remote member (if the node is clustered). 2. We remove the leaving member from its own view of the cluster. This way, if the leaving node was out-of-sync about the cluster membership because it lost its state for instance, we are sure that at the end, everyone agrees. However, when the leaving node is removed using a remote member, that member will stop the leaving Ra server. Therefore, when we try the second remove on the leaving node, we might get a `{error, noproc}` error because the Ra process already exited. [How] We adopt the same solution as the error handling done with `wait_for_leader()`: if `ra:remove_member()` returns the `noproc` error, we consider that's ok and proceed with the reset. This should fix a rare transient failure that I saw in CI but was never able to reproduce locally.

codecov · 2024-09-11T16:38:14Z

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Project coverage is 89.53%. Comparing base (c5722bf) to head (8491582).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/khepri_cluster.erl	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #294      +/-   ##
==========================================
- Coverage   89.68%   89.53%   -0.15%     
==========================================
  Files          21       21              
  Lines        3188     3192       +4     
==========================================
- Hits         2859     2858       -1     
- Misses        329      334       +5

Flag	Coverage Δ
erlang-25	`88.78% <0.00%> (+0.04%)`	⬆️
erlang-26	`89.44% <0.00%> (+0.04%)`	⬆️
erlang-27	`89.47% <0.00%> (-0.12%)`	⬇️
os-ubuntu-latest	`89.53% <0.00%> (-0.05%)`	⬇️
os-windows-latest	`89.53% <0.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dumbbell · 2024-09-11T16:48:12Z

The patch coverage is 0% because a follow-up changes to khepri_cluster will greatly increase the chance to hit this error (100% for me locally). This is how the problem was discovered. However, I wanted to commit this one first because CI will likely fail for that other patch.

dumbbell added the bug Something isn't working label Sep 11, 2024

dumbbell added this to the v0.16.0 milestone Sep 11, 2024

dumbbell requested a review from the-mikedavis September 11, 2024 16:13

dumbbell self-assigned this Sep 11, 2024

dumbbell force-pushed the fix-reset-race-condition branch from b5d503b to f495420 Compare September 11, 2024 16:15

dumbbell force-pushed the fix-reset-race-condition branch from f495420 to 8491582 Compare September 11, 2024 16:31

the-mikedavis approved these changes Sep 11, 2024

View reviewed changes

dumbbell marked this pull request as ready for review September 11, 2024 16:45

dumbbell merged commit 80ef2a3 into main Sep 11, 2024
12 checks passed

dumbbell deleted the fix-reset-race-condition branch September 11, 2024 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

khepri_cluster: Fix race condition in the reset code #294

khepri_cluster: Fix race condition in the reset code #294

dumbbell commented Sep 11, 2024

codecov bot commented Sep 11, 2024 •

edited

Loading

dumbbell commented Sep 11, 2024

khepri_cluster: Fix race condition in the reset code #294

khepri_cluster: Fix race condition in the reset code #294

Conversation

dumbbell commented Sep 11, 2024

Why

How

codecov bot commented Sep 11, 2024 • edited Loading

Codecov Report

dumbbell commented Sep 11, 2024

codecov bot commented Sep 11, 2024 •

edited

Loading