Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Raft::trigger()::allow_next_revert() allow to reset replication for next detected follower log revert #1259

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

drmingdrmer
Copy link
Member

@drmingdrmer drmingdrmer commented Oct 15, 2024

Changelog

Feature: Raft::trigger()::allow_next_revert() allow to reset replication for next detected follower log revert

This method requests the RaftCore to allow to reset replication for a
specific node when log revert is detected.

  • allow=true: This method instructs the RaftCore to allow the target
    node's log to revert to a previous state for one time.

  • allow=false: This method instructs the RaftCore to panic if the
    target node's log revert

Behavior

  • If this node is the Leader, it will attempt to replicate logs to the
    target node from the beginning.
  • If this node is not the Leader, the request is ignored.
  • If the target node is not found, the request is ignored.

Automatic Replication Reset

When the loosen-follower-log-revert feature flag is enabled, the
Leader automatically reset replication if it detects that the target
node's log has reverted. This feature is primarily useful in testing
environments.

Production Considerations

In production environments, state reversion is a critical issue that
should not be automatically handled. However, there may be scenarios
where a Follower's data is intentionally removed and needs to rejoin the
cluster(without membership changes). In such cases, the Leader should
reinitialize replication for that node with the following steps:


This change is Reviewable

…ation for next detected follower log revert

This method requests the RaftCore to allow to reset replication for a
specific node when log revert is detected.

- `allow=true`: This method instructs the RaftCore to allow the target
  node's log to revert to a previous state for one time.

- `allow=false`: This method instructs the RaftCore to panic if the
  target node's log revert

### Behavior

- If this node is the Leader, it will attempt to replicate logs to the
  target node from the beginning.
- If this node is not the Leader, the request is ignored.
- If the target node is not found, the request is ignored.

### Automatic Replication Reset

When the `loosen-follower-log-revert` feature flag is enabled, the
Leader automatically reset replication if it detects that the target
node's log has reverted. This feature is primarily useful in testing
environments.

### Production Considerations

In production environments, state reversion is a critical issue that
should not be automatically handled. However, there may be scenarios
where a Follower's data is intentionally removed and needs to rejoin the
cluster(without membership changes). In such cases, the Leader should
reinitialize replication for that node with the following steps:

- Shut down the target node.
- call [`Self::allow_next_revert`] on the Leader.
- Clear the target node's data directory.
- Restart the target node.

- Fix: databendlabs#1251
Copy link
Collaborator

@schreter schreter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay for the review. We had our first demo of our storage system (on top of openraft) to our big boss yesterday, so it was a bit hectic.

In general, it looks good, but see the comments.

However, there may be scenarios
where a Follower's data is intentionally removed and needs to rejoin the
cluster(without membership changes).

Not only completely removed - they could be also reset to an older (valid) state, e.g., when the last log segment is lost, in case of point-in-time recovery (loss of updates since some time), etc. But, that shouldn't matter - the follower will get logs from a common point-in-time with the leader, right? I.e., not from the beginning (unless it was completely cleared).

Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @drmingdrmer)


openraft/src/engine/handler/replication_handler/mod.rs line 235 at r1 (raw file):

        let Some(prog_entry) = self.leader.progress.get_mut(&target) else {
            tracing::warn!(
                "target node {} not found in progress tracker, when {}",

Hm, should this be a warning trace or should the API return an error detailing the reason why it was not done (i.e., not a leader or follower not found)?


openraft/src/progress/entry/mod.rs line 146 at r1 (raw file):

                );

                self.matching = None;

Here, shouldn't it set it to conflict instead? That's the last common log entry, right? In that case, the replication will continue from this point, right? But not sure, I'm not that deep in the internals of the state machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make loosen-follower-log-revert a runtime functionality instead of a feature
2 participants