Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invariants gap: committing a stale snapshot #157

Open
pav-kv opened this issue Feb 6, 2024 · 1 comment
Open

invariants gap: committing a stale snapshot #157

pav-kv opened this issue Feb 6, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@pav-kv
Copy link
Contributor

pav-kv commented Feb 6, 2024

The gap is yet to be confirmed, it's unclear if it can happen. Discovered while cleaning up the raftLog/unstable structures.

The unstable structure tracks log entries and a snapshot in flight to storage.

The undocumented invariant seems to be: if there is a snapshot, then the offset == snapshot.Metadata.Index + 1. The underlying sense of this invariant is that the snapshot is a "wide entry" at index offset - 1 that immediately precedes all the entries. Correspondingly, there is a requirement to write this snapshot to storage earlier than the entries.

We maintain this invariant during the lifecycle of unstable, and rely on it when extracting the log size, and terms of entries at specific indices.

However, there is one place where the invariant can be violated:

raft/log_unstable.go

Lines 202 to 208 in 073f90d

case fromIndex <= u.offset:
u.logger.Infof("replace the unstable entries from index %d", fromIndex)
// The log is being truncated to before our current offset
// portion, so set the offset and replace the entries.
u.entries = ents
u.offset = fromIndex
u.offsetInProgress = u.offset

When fromIndex < u.offset (strictly less), we don't erase u.snapshot, and as a result it's possible that u.offset < u.snapshot.Metadata.Index + 1. It's also possible that the snapshot index exceeds lastIndex().

It's unclear if the above situation can happen, and what is the consequence. Best case: it can't happen because the layers above won't submit an append at index below the unstable.snapshot. Worst case: it can happen, and things can be written or committed out of order.

In any case, we need to straighten invariants here, and fix this code to be safer.

To be investigated.

@pav-kv pav-kv added the enhancement New feature or request label Feb 6, 2024
@MrDXY
Copy link
Contributor

MrDXY commented Mar 18, 2024

Hi @pav-kv , I investigated the question you posted and found some interesting information.

Can this situation happen? Yes.

To clarify, I come up with two questions:

  1. When will I receive unstable Entries?
    I checked the code and found two places that use the truncateAndAppend function. append and maybeAppend (which calls append internally).
  • append is used directly by the Leader for Propose requests, as well as for the Leader to append an empty entry when first initiated.
    • The Leader will determine the entry index for those requests and rank them after r.raftLog.lastIndex(). As a result, violations are unlikely to occur here.
  • maybeAppend is used by a Follower or a Candidate for the AppendEntry requests.
    • The function checks the term of each append entry to find the conflict index, and appends entries after the conflict index (overwriting the original conflict entries).
    • This is where a violation can happen.
  1. When will I receive an unstable Snapshot?
  • The only function that sets the unstable.snapshot is unstable.restore().
    • This function is triggered by MsgSnap.
      MsgSnap will be sent when a Leader fails to send append entries to a Follower due to a failure in finding the corresponding term for the previous index. It is necessary to match the previous index's term because it is a design requirement of the Raft consensus algorithm that the previous entry's index and term must match. If the previous index's term fails to match, it is likely because a new snapshot has been created which has cut some entries in storage. If the previous index is in the cut logs, then it fails to find the matched term.

Example:
Let's say there are three nodes, A is the Leader with Term 1. The log index of A has reached 2, and B is 2, C is 0. Now A is sending log[1, 2] to C, we call it AE1.

   0 1 2 3 4 5
A  1 1 1
B  1 1 1
C  1

Just as AE1 is flying on the internet, A is triggered to create a new snapshot at index 2. Let's call it s.

   0 1 2 3 4 5
A      s
B  1 1 1
C  1

Then Leader A sends the next round of logs to C, and the situation becomes like this:

   0 1 2 3 4 5
A      s
B  1 1 1
C      s

Now AE1 has arrived to C: And the C is still in Term1, and it's unstable entries is empty. So it can receive the log[1, 2].

   0 1 2 3 4 5
A      s
B  1 1 1
C      s
     1 1

A smaller log has been appended to the unstable entries.

Please kindly let me know if I have missed something or made a mistake.

Will it cause chaos? TODO
This can lead to chaos in two ways: by considering who will use unstable data and who will receive the 'Ready' structure. -

  • The impact on the WAL?
  • Whether it affect the Apply process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants