Potential Linearizability Violation in linearizableReadNotify During Concurrent Read and Write Operations #18521

MrDXY · 2024-09-01T15:10:02Z

MrDXY
Sep 1, 2024

Problem Description:

I've encountered a scenario while working with the ETCD Read Index that I believe might expose a potential flaw in the linearizableReadNotify mechanism. However, I am not entirely certain if my understanding is correct, and I would greatly appreciate any insights or clarifications from the community.

Consider the following scenario where three requests are sent almost concurrently to an ETCD cluster with three nodes. All three requests are targeted to the same key, and the key is initially set to v1, both the leader and followers have an apply index of 100 and a commit index of 100.

goroutine-1 initiates a linearizable read by calling linearizableReadNotify, targeting the follower node.
goroutine-2 proposes a write to change the key's value to v2, targeting the leader node.
goroutine-3 initiates another linearizable read by calling linearizableReadNotify, also targeting the same follower node as goroutine-1.
All three requests arrive in sequence, but they arrive at the cluster very close in time.

The sequence of operations is as follows:

goroutine-1 sends a signal to readwaitc and begins waiting for the readNotifier to be notified. This triggers the linearizableReadLoop to request the current commit index from the leader.
goroutine-2 proposes a written request, and the leader records such change to the raft log (unstable). Then try to broadcast the log to followers.
goroutine-3 finds that readwaitc is blocked and falls back to waiting on the same readNotifier instance as goroutine-1 due to the near-simultaneous arrival of their requests. (A Read lock doesn't stop two goroutines claiming on the same resource)
The leader receives the read index request. and after broadcasting heartbeats to the majority of the cluster members to confirm the leader status, it replies with its commit index as 100 (since the write propose has not been confirmed by the majority of the cluster yet).
Meanwhile, goroutine-2's write operation successfully replicates the write to a majority of the nodes, advancing the leader's commit index to 101. However, these logs are only replicated, and not yet applied to the followers.
goroutine-1 and goroutine-3, after being notified that the follower apply index (100) is at least as high as the read index (100), proceed to read the key's value. Both retrieve the old value v1 from the follower.

Issues Identified:

goroutine-3 Reads Stale Data: Although goroutine-3 is initiated after goroutine-2, it still reads the old value v1. This contradicts the linearizability guarantee, as goroutine-3 should have observed the updated value v2 given that the write by goroutine-2 logically precedes it.

Commit Index Update Timing: Even if the leader's commit index had updated to 101 before the read index request from goroutine-1 and goroutine-3, they would both read the new value v2. However, this would violate the linearizability guarantee for goroutine-1, which should observe the value v1.

Related code (simplified)

func (s *EtcdServer) linearizableReadLoop() {
  for {
    requestID := s.reqIDGen.Next()
    select {
    case <-s.readwaitc:
    case <-s.stopping:
      return
    }


    nextnr := newNotifier()
    s.readMu.Lock()
    nr := s.readNotifier
    s.readNotifier = nextnr
    s.readMu.Unlock()

    confirmedIndex, err := s.requestCurrentIndex(requestID)
    if err != nil {
      nr.notify(err)
      continue
    }

    appliedIndex := s.getAppliedIndex()

    if appliedIndex < confirmedIndex {
      select {
      case <-s.applyWait.Wait(confirmedIndex):
      case <-s.stopping:
        return
      }
    }
    // unblock all l-reads requested at indices before confirmedIndex
    nr.notify(nil)
  }
}

func (s *EtcdServer) linearizableReadNotify(ctx context.Context) error {
  s.readMu.RLock()
  nc := s.readNotifier
  s.readMu.RUnlock()

  // signal linearizable loop for current notify if it hasn't been already
  select {
  case s.readwaitc <- struct{}{}:
  default:
  }

  // wait for read state notification
  select {
  case <-nc.c:
    return nc.err
  case <-ctx.Done():
    return ctx.Err()
  case <-s.done:
    return errors.ErrStopped
  }
}

type notifier struct {
  c   chan struct{}
  err error
}

func newNotifier() *notifier {
  return &notifier{
    c: make(chan struct{}),
  }
}

func (nc *notifier) notify(err error) {
  nc.err = err
  close(nc.c)
}

ahrtr · 2024-09-02T09:49:55Z

ahrtr
Sep 2, 2024
Maintainer

Thanks for raising this question, but I do not see any issue.

It's good to dig into the source code, but you'd better understand linearizability beyond the source code, or from its definition (refer to https://en.wikipedia.org/wiki/Linearizability).

A history is linearizable if there is a linear order σ of the completed operations such that:

For every completed operation in σ, the operation returns the same result in the execution as the operation would return if every operation was completed one by one in order σ.
If an operation op1 completes (gets a response) before op2 begins (invokes), then op1 precedes op2 in σ.

The first point above specifically addresses Serializability. For a more detailed understanding of this concept, please refer to this example, which explains it thoroughly.

The second point addresses read-after-write consistency. It guarantees that the result of a write operation will be visible to subsequent read operations, but only if the reads are issued after the write operation has been completed and acknowledged. I think that this might be what you missed.

Serializability is a concept traditionally associated with database systems, ensuring that concurrent transactions yield results equivalent to some serial order. Linearizability, on the other hand, originates from distributed systems, ensuring that operations appear to occur instantaneously in some global order, and is often considered a stricter model in terms of real-time ordering. However, with many modern database systems also being distributed, the distinction between these concepts is less clear, and they often overlap in practice.

1 reply

serathius Sep 2, 2024
Maintainer

I agree with @ahrtr and would like to add some additional points to clarify the situation. Would recommend to read https://jepsen.io/consistency/models/linearizable to get a better understanding of the topic.

Linearizability guarantees a consistent ordering of operations based on their real-time completion; however, the limitations on the order of operations are less strict than you might think. If operation B completes after operation A starts, then B must be ordered after A. However, if operation A and B intersect with each other, their order is not determined. Linearizability means that operations appear to take place atomically within the time of their execution. If A and B were executed concurrently, the database can reorder them as it wants.

For the cases you mentioned:

goroutine-3 Reads Stale Data: It's okay to return a snapshot of read index 100 as long as that index was the latest during the time we executed the read. The fact that we didn't manage to return the response before the index increased to 101 is okay; we just ordered the read before the write.
Commit Index Update Timing: Linearizability isn't about goroutines and what they observe. It's about what the client observes.

MrDXY · 2024-09-02T15:44:38Z

MrDXY
Sep 2, 2024
Author

Thank you all for your responses！
I took some time to thoroughly read the article on linearizability that @ahrtr and @serathius suggested.

It guarantees that the result of a write operation will be visible to subsequent read operations, but only if the reads are issued after the write operation has been completed and acknowledged.

This cleared up the question that had been bothering me for days.

That linearizability emphasizes the point when an operation is considered complete. The time bounds of linearizability ensure that any changes become visible to other participants only after the operation has finished. This means that each read reflects a current state between the invocation and completion of the operation.

If a write operation completes before the read invocation, the read can see the change.
If a write operation completes after the read completion, it can't see that.
If a write operation completes between the read invocation and completion, it depends on the timing of which one took place first. (My question)
I’m truly grateful to everyone—this new understanding has been incredibly valuable!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Linearizability Violation in linearizableReadNotify During Concurrent Read and Write Operations #18521

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Potential Linearizability Violation in linearizableReadNotify During Concurrent Read and Write Operations #18521

MrDXY Sep 1, 2024

Problem Description:

The sequence of operations is as follows:

Issues Identified:

Related code (simplified)

Replies: 2 comments · 1 reply

ahrtr Sep 2, 2024 Maintainer

serathius Sep 2, 2024 Maintainer

MrDXY Sep 2, 2024 Author

MrDXY
Sep 1, 2024

Replies: 2 comments 1 reply

ahrtr
Sep 2, 2024
Maintainer

serathius Sep 2, 2024
Maintainer

MrDXY
Sep 2, 2024
Author