mvcc: allow large concurrent reads under light write workload #9296

xiang90 · 2018-02-07T21:30:09Z

I found that etcd is unable to process large (take more than 100ms) read requests concurrently even under light or no write workload. We already put some effort to solve the current read problem by adding a writeback buffer in front of the database transaction. So I want to fix this problem.

The root cause of the problem is that although we use Read Lock to allow current read go-routines, but we have another long running go-routine that tries to flush data back to disk every 100ms which requires Write Lock.

Obviously, the write lock has a higher priority that read lock, and will always kick in to block any later read locks. So with this behavior, we basically put a barrier for every 100ms to block all reads until the write lock is acquired by the long running go-routine and then released. The write lock can only be acquired after all previous read locks are released, which can be seconds if the reads are large. Thus we can block reads for seconds even when there is no write.

However, if there is no pending write, we do not need to get the write lock at all.

This PR delays the routine to acquire the write lock only when there are pending writes. So under no or light write conditions, large read can be executed concurrently.

/cc @jpbetz @mitake @heyitsanthony

xiang90 · 2018-02-07T21:37:32Z

@mitake

The commit routine is also probably the cause of what you observed #9199.

Basically when you make the range size larger, it blocks the flush longer. Thus the writes batch is larger, and write can be more effective. But it is probably not what we want?

xiang90 · 2018-02-08T00:02:01Z

I formed some new ideas to solve this issue + the problem #9199 tries to solve... so probably wait for my follow up commit :)

mitake · 2018-02-08T05:14:28Z

@xiang90

The commit routine is also probably the cause of what you observed #9199.

I evaluated this branch with the benchmark and the result was similar to the master branch for now. Of course I'll wait your follow up commits.

 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%14s
Summary:
  Total:        14.5203 secs.
  Slowest:      2.6956 secs.
  Fastest:      0.0006 secs.
  Average:      0.0140 secs.
  Stddev:       0.0862 secs.
  Requests/sec: 6886.9280

Response time histogram:
  0.0006 [1]    |
  0.2701 [99799]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.5396 [100]  |
  0.8091 [0]    |
  1.0786 [0]    |
  1.3481 [0]    |
  1.6176 [0]    |
  1.8871 [0]    |
  2.1566 [0]    |
  2.4261 [0]    |
  2.6956 [100]  |

Latency distribution:
  10% in 0.0060 secs.
  25% in 0.0067 secs.
  50% in 0.0089 secs.
  75% in 0.0116 secs.
  90% in 0.0176 secs.
  95% in 0.0292 secs.
  99% in 0.0411 secs.
  99.9% in 2.6756 secs.

Probably the ideal solution for reducing the blocking of read txns would be letting the readers deal with only snapshots and making them lock free (completely no tx.Lock() in the reader side). But it makes the compaction process complicated because the read txns cannot indicate their existence. It means we need to have a deferred reclamation mechanism. It will introduce big design update of mvcc pkg.

For now, making read txns fine grained in a voluntary manner would be an ad-hoc but reasonable solution, I think.

xiang90 · 2018-02-08T05:29:40Z

I evaluated this branch with the benchmark and the result was similar to the master branch for now. Of course I'll wait your follow up commits.

This PR will only help on large concurrent read request, but not on blocking puts.

Probably the ideal solution for reducing the blocking of read txns would be letting the readers deal with only snapshots and making them lock free (completely no tx.Lock() in the reader side). But it makes the compaction process complicated because the read txns cannot indicate their existence. It means we need to have a deferred reclamation mechanism. It will introduce big design update of mvcc pkg.

Yea. We should dedicate large read request to another boltdb read tx after a commit happens after we receive the request instead of using the singleton read tx in current backend which is able to read the recent write in the writeback buffer.

heyitsanthony · 2018-02-08T06:10:01Z

internal/mvcc/backend/batch_tx.go

+	// no need to reset read tx if there is no write.
+	// call commit to update stats.
+	if t.batchTx.pending == 0 && !stop {
+		t.batchTx.commit(stop)


commit(false)

heyitsanthony · 2018-02-08T06:12:32Z

internal/mvcc/backend/batch_tx.go

@@ -231,6 +232,13 @@ func (t *batchTxBuffered) CommitAndStop() {
 }

 func (t *batchTxBuffered) commit(stop bool) {
+	// no need to reset read tx if there is no write.
+	// call commit to update stats.
+	if t.batchTx.pending == 0 && !stop {


if this approach isn't working (deadlocking in CI?), it might be easier to move the pending == 0 paths to backend.run() and avoid calling commit() there...

CI panics since the tx initialization code goes through here. I am still playing with the code to see what is the easier. Probably I should try what you just suggested.

codecov-io · 2018-02-08T13:23:01Z

Codecov Report

❗ No coverage uploaded for pull request base (master@f5d02f0). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #9296   +/-   ##
=========================================
  Coverage          ?   75.58%           
=========================================
  Files             ?      365           
  Lines             ?    30695           
  Branches          ?        0           
=========================================
  Hits              ?    23202           
  Misses            ?     5870           
  Partials          ?     1623

Impacted Files	Coverage Δ
internal/mvcc/backend/backend.go	`80% <100%> (ø)`
internal/mvcc/backend/batch_tx.go	`78.22% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5d02f0...3ebee21. Read the comment docs.

xiang90 · 2018-02-08T18:15:42Z

/cc @heyitsanthony can you take another look? change to what you suggested.

gyuho · 2018-02-08T22:58:36Z

internal/mvcc/backend/backend.go

-		b.batchTx.Commit()
+		b.batchTx.Lock()
+		pending := b.batchTx.pending
+		b.batchTx.Unlock()


Q. Doesn't b.batchTx.Unlock already reset pending count?

pending := b.(*batchTxBuffered).pending, where pending is 20k and batchLimit is 10k

Goes to b.(*batchTxBuffered).Unlock()

t.pending >= t.backend.batchLimit

(*batchTxBuffered).commit(false)

(*batchTx).commit(false)

(*batchTx).pending = 0

Shouldn't this be b.batchTx.batchTx.Unlock()?

oh. yea. it should be the lock of the mutex.

gyuho · 2018-02-09T00:07:20Z

Approach looks good. I will try benchmarking this code path by issuing two overlapping readers contending on backend.readTx. If @mitake can cross-check on this, it would be great.

hexfusion · 2018-02-09T00:11:22Z

Approach looks good. I will try benchmarking this code path by issuing two overlapping readers contending on backend.readTx.

@gyuho would you mind outlining this process in a little detail? Thanks!

heyitsanthony

lgtm

gyuho · 2018-02-09T16:37:11Z

// if no pending writes, should not block concurrent read transactions
//
// without https://github.com/coreos/etcd/pull/9296
//  1. (*backend).(*batchTxBuffered).Commit()
//  2. (*backend).(*batchTxBuffered).Mutex.Lock
//  3. (*backend).(*readTx).RWMutex.Lock
//    - blocks long running read transaction
//
// with https://github.com/coreos/etcd/pull/9296
//  1. (*backend).(*batchTxBuffered).safePending()
//  2. (*backend).(*batchTxBuffered).Mutex.Lock

/*
go test -v -run TestRead
*/

func TestReadBlockingWriteCommit(t *testing.T)    { testRead(t, true) }
func TestReadNonBlockingWriteCommit(t *testing.T) { testRead(t, false) }
func testRead(t *testing.T, block bool) {
	keyN := 1000000

	// trigger commit manually, by giving large batch
	// interval and limit
	b, tmppath := NewTmpBackend(time.Hour, 2*keyN)
	defer b.Close()
	defer os.Remove(tmppath)

	b.batchTx.UnsafeCreateBucket([]byte("key"))

	for i := 0; i < keyN; i++ {
		v := make([]byte, 3*1024)
		rand.Read(v)
		b.batchTx.Lock()
		b.batchTx.UnsafePut([]byte("key"), []byte(fmt.Sprintf("foo%10d", i)), v)
		b.batchTx.Unlock()
	}
	// commits all pending writes
	b.batchTx.Commit()
	fmt.Println("pending:", b.batchTx.pending)

	// for long read transactions
	rangeStart, rangeEnd := []byte(fmt.Sprintf("foo%10d", 0)), []byte(fmt.Sprintf("foo%10d", keyN))

	readFunc := func(name string) {
		log.Printf("read %q waiting for Lock", name)

		now := time.Now()
		b.ReadTx().Lock()
		log.Printf("read %q acquired Lock", name)
		rn := time.Now()
		aa, bb := b.ReadTx().UnsafeRange([]byte("key"), rangeStart, rangeEnd, 0)
		b.ReadTx().Unlock()

		log.Printf("read %q finished (took %v, range took %v) %d %d", name, time.Since(now), time.Since(rn), len(aa), len(bb))
	}

	total := time.Now()
	donec := make(chan struct{})

	go func() {
		readFunc("A")
		donec <- struct{}{}
	}()

	go func() {
		time.Sleep(time.Second)
		log.Println("commit starts")

		now := time.Now()
		if block {
			b.batchTx.Commit()
		} else {
			if b.batchTx.safePending() != 0 {
				b.batchTx.Commit()
			}
		}

		log.Printf("commit finished (took %v)", time.Since(now))
		donec <- struct{}{}
	}()

	time.Sleep(2 * time.Second)
	readFunc("B")

	<-donec
	<-donec

	fmt.Println("total:", time.Since(total))
}

func (t *batchTx) safePending() int {
	t.Mutex.Lock()
	defer t.Mutex.Unlock()
	return t.pending
}

LGTM

gyuho · 2018-02-09T23:14:16Z

Instead, compared two compiled etcd process before and after this patch, and compare how periodic commit operation affects incoming read transactions.

I wrote 1M keys, and wait until there's no pending writes. Previously, commit for-loop with batch interval acquires writer locks even when there's no pending write. Thus, new read transactions to acquire reader locks were blocked by commit operation holding writer locks. Had to add some sleeps to have larger window between commit operation writer lock acquire and release, so that read transaction can kick in and be blocked.

Then, spawned two separate read transactions concurrently. Before, each read transaction takes about 2.7s. When launched concurrently, the second reader blocks with commit operation holding writer locks, thus taking >4.47s. After this patch, all read transactions take <3.3s.

xiang90 force-pushed the c_r branch 4 times, most recently from 31af6df to 8e8538b Compare February 7, 2018 23:06

xiang90 force-pushed the c_r branch from 8e8538b to 84909fb Compare February 8, 2018 05:50

heyitsanthony reviewed Feb 8, 2018

View reviewed changes

xiang90 force-pushed the c_r branch 5 times, most recently from 8ef182b to ff50041 Compare February 8, 2018 07:11

xiang90 force-pushed the c_r branch from ff50041 to e50aacb Compare February 8, 2018 18:15

gyuho added the kind/need-release-note-v3.4 label Feb 8, 2018

gyuho reviewed Feb 8, 2018

View reviewed changes

mvcc: allow large concurrent reads under light write workload

3ebee21

xiang90 force-pushed the c_r branch from e50aacb to 3ebee21 Compare February 8, 2018 23:57

heyitsanthony approved these changes Feb 9, 2018

View reviewed changes

xiang90 merged commit 0a20767 into etcd-io:master Feb 9, 2018

xiang90 deleted the c_r branch February 9, 2018 19:34

jingyih mentioned this pull request Aug 16, 2019

etcd clientv3 WithCountOnly performence #11036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mvcc: allow large concurrent reads under light write workload #9296

mvcc: allow large concurrent reads under light write workload #9296

xiang90 commented Feb 7, 2018 •

edited

Loading

xiang90 commented Feb 7, 2018

xiang90 commented Feb 8, 2018

mitake commented Feb 8, 2018

xiang90 commented Feb 8, 2018

heyitsanthony Feb 8, 2018

heyitsanthony Feb 8, 2018

xiang90 Feb 8, 2018

codecov-io commented Feb 8, 2018 •

edited

Loading

xiang90 commented Feb 8, 2018

gyuho Feb 8, 2018 •

edited

Loading

xiang90 Feb 8, 2018

gyuho commented Feb 9, 2018

hexfusion commented Feb 9, 2018 via email •

edited

Loading

heyitsanthony left a comment

gyuho commented Feb 9, 2018 •

edited

Loading

gyuho commented Feb 9, 2018

mvcc: allow large concurrent reads under light write workload #9296

mvcc: allow large concurrent reads under light write workload #9296

Conversation

xiang90 commented Feb 7, 2018 • edited Loading

xiang90 commented Feb 7, 2018

xiang90 commented Feb 8, 2018

mitake commented Feb 8, 2018

xiang90 commented Feb 8, 2018

heyitsanthony Feb 8, 2018

Choose a reason for hiding this comment

heyitsanthony Feb 8, 2018

Choose a reason for hiding this comment

xiang90 Feb 8, 2018

Choose a reason for hiding this comment

codecov-io commented Feb 8, 2018 • edited Loading

Codecov Report

xiang90 commented Feb 8, 2018

gyuho Feb 8, 2018 • edited Loading

Choose a reason for hiding this comment

xiang90 Feb 8, 2018

Choose a reason for hiding this comment

gyuho commented Feb 9, 2018

hexfusion commented Feb 9, 2018 via email • edited Loading

heyitsanthony left a comment

Choose a reason for hiding this comment

gyuho commented Feb 9, 2018 • edited Loading

gyuho commented Feb 9, 2018

xiang90 commented Feb 7, 2018 •

edited

Loading

codecov-io commented Feb 8, 2018 •

edited

Loading

gyuho Feb 8, 2018 •

edited

Loading

hexfusion commented Feb 9, 2018 via email •

edited

Loading

gyuho commented Feb 9, 2018 •

edited

Loading