Feature: `Raft::trigger()::allow_next_revert()` allow to reset replication for next detected follower log revert #1259

drmingdrmer · 2024-10-15T04:54:18Z

Changelog

Feature: `Raft::trigger()::allow_next_revert()` allow to reset replication for next detected follower log revert

This method requests the RaftCore to allow to reset replication for a
specific node when log revert is detected.

allow=true: This method instructs the RaftCore to allow the target
node's log to revert to a previous state for one time.
allow=false: This method instructs the RaftCore to panic if the
target node's log revert

Behavior

If this node is the Leader, it will attempt to replicate logs to the
target node from the beginning.
If this node is not the Leader, the request is ignored.
If the target node is not found, the request is ignored.

Automatic Replication Reset

When the loosen-follower-log-revert feature flag is enabled, the
Leader automatically reset replication if it detects that the target
node's log has reverted. This feature is primarily useful in testing
environments.

Production Considerations

In production environments, state reversion is a critical issue that
should not be automatically handled. However, there may be scenarios
where a Follower's data is intentionally removed and needs to rejoin the
cluster(without membership changes). In such cases, the Leader should
reinitialize replication for that node with the following steps:

Shut down the target node.
call [Self::allow_next_revert] on the Leader.
Clear the target node's data directory.
Restart the target node.
Fix: Make loosen-follower-log-revert a runtime functionality instead of a feature #1251

This change is

…ation for next detected follower log revert This method requests the RaftCore to allow to reset replication for a specific node when log revert is detected. - `allow=true`: This method instructs the RaftCore to allow the target node's log to revert to a previous state for one time. - `allow=false`: This method instructs the RaftCore to panic if the target node's log revert ### Behavior - If this node is the Leader, it will attempt to replicate logs to the target node from the beginning. - If this node is not the Leader, the request is ignored. - If the target node is not found, the request is ignored. ### Automatic Replication Reset When the `loosen-follower-log-revert` feature flag is enabled, the Leader automatically reset replication if it detects that the target node's log has reverted. This feature is primarily useful in testing environments. ### Production Considerations In production environments, state reversion is a critical issue that should not be automatically handled. However, there may be scenarios where a Follower's data is intentionally removed and needs to rejoin the cluster(without membership changes). In such cases, the Leader should reinitialize replication for that node with the following steps: - Shut down the target node. - call [`Self::allow_next_revert`] on the Leader. - Clear the target node's data directory. - Restart the target node. - Fix: databendlabs#1251

schreter

Sorry for the delay for the review. We had our first demo of our storage system (on top of openraft) to our big boss yesterday, so it was a bit hectic.

In general, it looks good, but see the comments.

However, there may be scenarios
where a Follower's data is intentionally removed and needs to rejoin the
cluster(without membership changes).

Not only completely removed - they could be also reset to an older (valid) state, e.g., when the last log segment is lost, in case of point-in-time recovery (loss of updates since some time), etc. But, that shouldn't matter - the follower will get logs from a common point-in-time with the leader, right? I.e., not from the beginning (unless it was completely cleared).

Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @drmingdrmer)

openraft/src/engine/handler/replication_handler/mod.rs line 235 at r1 (raw file):

        let Some(prog_entry) = self.leader.progress.get_mut(&target) else {
            tracing::warn!(
                "target node {} not found in progress tracker, when {}",

Hm, should this be a warning trace or should the API return an error detailing the reason why it was not done (i.e., not a leader or follower not found)?

openraft/src/progress/entry/mod.rs line 146 at r1 (raw file):

                );

                self.matching = None;

Here, shouldn't it set it to conflict instead? That's the last common log entry, right? In that case, the replication will continue from this point, right? But not sure, I'm not that deep in the internals of the state machine.

drmingdrmer requested a review from schreter October 15, 2024 04:54

drmingdrmer force-pushed the 161-reset-progress branch from 9fb2f5c to 83690f5 Compare October 15, 2024 04:57

drmingdrmer force-pushed the 161-reset-progress branch from 83690f5 to ab3aa16 Compare October 15, 2024 05:00

schreter approved these changes Oct 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: `Raft::trigger()::allow_next_revert()` allow to reset replication for next detected follower log revert #1259

Feature: `Raft::trigger()::allow_next_revert()` allow to reset replication for next detected follower log revert #1259

drmingdrmer commented Oct 15, 2024 •

edited

Loading

schreter left a comment

Feature: Raft::trigger()::allow_next_revert() allow to reset replication for next detected follower log revert #1259

Are you sure you want to change the base?

Feature: Raft::trigger()::allow_next_revert() allow to reset replication for next detected follower log revert #1259

Conversation

drmingdrmer commented Oct 15, 2024 • edited Loading

Changelog

Feature: Raft::trigger()::allow_next_revert() allow to reset replication for next detected follower log revert

Behavior

Automatic Replication Reset

Production Considerations

schreter left a comment

Choose a reason for hiding this comment

Feature: `Raft::trigger()::allow_next_revert()` allow to reset replication for next detected follower log revert #1259

Feature: `Raft::trigger()::allow_next_revert()` allow to reset replication for next detected follower log revert #1259

drmingdrmer commented Oct 15, 2024 •

edited

Loading

Feature: `Raft::trigger()::allow_next_revert()` allow to reset replication for next detected follower log revert