Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mvcc: allow large concurrent reads under light write workload #9296

Merged
merged 1 commit into from
Feb 9, 2018

Conversation

xiang90
Copy link
Contributor

@xiang90 xiang90 commented Feb 7, 2018

I found that etcd is unable to process large (take more than 100ms) read requests concurrently even under light or no write workload. We already put some effort to solve the current read problem by adding a writeback buffer in front of the database transaction. So I want to fix this problem.

The root cause of the problem is that although we use Read Lock to allow current read go-routines, but we have another long running go-routine that tries to flush data back to disk every 100ms which requires Write Lock.

Obviously, the write lock has a higher priority that read lock, and will always kick in to block any later read locks. So with this behavior, we basically put a barrier for every 100ms to block all reads until the write lock is acquired by the long running go-routine and then released. The write lock can only be acquired after all previous read locks are released, which can be seconds if the reads are large. Thus we can block reads for seconds even when there is no write.

However, if there is no pending write, we do not need to get the write lock at all.

This PR delays the routine to acquire the write lock only when there are pending writes. So under no or light write conditions, large read can be executed concurrently.

/cc @jpbetz @mitake @heyitsanthony

@xiang90
Copy link
Contributor Author

xiang90 commented Feb 7, 2018

@mitake

The commit routine is also probably the cause of what you observed #9199.

Basically when you make the range size larger, it blocks the flush longer. Thus the writes batch is larger, and write can be more effective. But it is probably not what we want?

@xiang90
Copy link
Contributor Author

xiang90 commented Feb 8, 2018

I formed some new ideas to solve this issue + the problem #9199 tries to solve... so probably wait for my follow up commit :)

@mitake
Copy link
Contributor

mitake commented Feb 8, 2018

@xiang90

The commit routine is also probably the cause of what you observed #9199.

I evaluated this branch with the benchmark and the result was similar to the master branch for now. Of course I'll wait your follow up commits.

 100000 / 100000 Boooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%14s
Summary:
  Total:        14.5203 secs.
  Slowest:      2.6956 secs.
  Fastest:      0.0006 secs.
  Average:      0.0140 secs.
  Stddev:       0.0862 secs.
  Requests/sec: 6886.9280

Response time histogram:
  0.0006 [1]    |
  0.2701 [99799]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.5396 [100]  |
  0.8091 [0]    |
  1.0786 [0]    |
  1.3481 [0]    |
  1.6176 [0]    |
  1.8871 [0]    |
  2.1566 [0]    |
  2.4261 [0]    |
  2.6956 [100]  |

Latency distribution:
  10% in 0.0060 secs.
  25% in 0.0067 secs.
  50% in 0.0089 secs.
  75% in 0.0116 secs.
  90% in 0.0176 secs.
  95% in 0.0292 secs.
  99% in 0.0411 secs.
  99.9% in 2.6756 secs.

Probably the ideal solution for reducing the blocking of read txns would be letting the readers deal with only snapshots and making them lock free (completely no tx.Lock() in the reader side). But it makes the compaction process complicated because the read txns cannot indicate their existence. It means we need to have a deferred reclamation mechanism. It will introduce big design update of mvcc pkg.

For now, making read txns fine grained in a voluntary manner would be an ad-hoc but reasonable solution, I think.

@xiang90
Copy link
Contributor Author

xiang90 commented Feb 8, 2018

I evaluated this branch with the benchmark and the result was similar to the master branch for now. Of course I'll wait your follow up commits.

This PR will only help on large concurrent read request, but not on blocking puts.

Probably the ideal solution for reducing the blocking of read txns would be letting the readers deal with only snapshots and making them lock free (completely no tx.Lock() in the reader side). But it makes the compaction process complicated because the read txns cannot indicate their existence. It means we need to have a deferred reclamation mechanism. It will introduce big design update of mvcc pkg.

Yea. We should dedicate large read request to another boltdb read tx after a commit happens after we receive the request instead of using the singleton read tx in current backend which is able to read the recent write in the writeback buffer.

// no need to reset read tx if there is no write.
// call commit to update stats.
if t.batchTx.pending == 0 && !stop {
t.batchTx.commit(stop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit(false)

@@ -231,6 +232,13 @@ func (t *batchTxBuffered) CommitAndStop() {
}

func (t *batchTxBuffered) commit(stop bool) {
// no need to reset read tx if there is no write.
// call commit to update stats.
if t.batchTx.pending == 0 && !stop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this approach isn't working (deadlocking in CI?), it might be easier to move the pending == 0 paths to backend.run() and avoid calling commit() there...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI panics since the tx initialization code goes through here. I am still playing with the code to see what is the easier. Probably I should try what you just suggested.

@codecov-io
Copy link

codecov-io commented Feb 8, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@f5d02f0). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #9296   +/-   ##
=========================================
  Coverage          ?   75.58%           
=========================================
  Files             ?      365           
  Lines             ?    30695           
  Branches          ?        0           
=========================================
  Hits              ?    23202           
  Misses            ?     5870           
  Partials          ?     1623
Impacted Files Coverage Δ
internal/mvcc/backend/backend.go 80% <100%> (ø)
internal/mvcc/backend/batch_tx.go 78.22% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f5d02f0...3ebee21. Read the comment docs.

@xiang90
Copy link
Contributor Author

xiang90 commented Feb 8, 2018

/cc @heyitsanthony can you take another look? change to what you suggested.

b.batchTx.Commit()
b.batchTx.Lock()
pending := b.batchTx.pending
b.batchTx.Unlock()
Copy link
Contributor

@gyuho gyuho Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q. Doesn't b.batchTx.Unlock already reset pending count?

  1. pending := b.(*batchTxBuffered).pending, where pending is 20k and batchLimit is 10k
  2. Goes to b.(*batchTxBuffered).Unlock()
  3. t.pending >= t.backend.batchLimit
  4. (*batchTxBuffered).commit(false)
  5. (*batchTx).commit(false)
  6. (*batchTx).pending = 0

Shouldn't this be b.batchTx.batchTx.Unlock()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. yea. it should be the lock of the mutex.

@gyuho
Copy link
Contributor

gyuho commented Feb 9, 2018

Approach looks good. I will try benchmarking this code path by issuing two overlapping readers contending on backend.readTx. If @mitake can cross-check on this, it would be great.

@hexfusion
Copy link
Contributor

hexfusion commented Feb 9, 2018 via email

Copy link
Contributor

@heyitsanthony heyitsanthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@gyuho
Copy link
Contributor

gyuho commented Feb 9, 2018

// if no pending writes, should not block concurrent read transactions
//
// without https://github.com/coreos/etcd/pull/9296
//  1. (*backend).(*batchTxBuffered).Commit()
//  2. (*backend).(*batchTxBuffered).Mutex.Lock
//  3. (*backend).(*readTx).RWMutex.Lock
//    - blocks long running read transaction
//
// with https://github.com/coreos/etcd/pull/9296
//  1. (*backend).(*batchTxBuffered).safePending()
//  2. (*backend).(*batchTxBuffered).Mutex.Lock

/*
go test -v -run TestRead
*/

func TestReadBlockingWriteCommit(t *testing.T)    { testRead(t, true) }
func TestReadNonBlockingWriteCommit(t *testing.T) { testRead(t, false) }
func testRead(t *testing.T, block bool) {
	keyN := 1000000

	// trigger commit manually, by giving large batch
	// interval and limit
	b, tmppath := NewTmpBackend(time.Hour, 2*keyN)
	defer b.Close()
	defer os.Remove(tmppath)

	b.batchTx.UnsafeCreateBucket([]byte("key"))

	for i := 0; i < keyN; i++ {
		v := make([]byte, 3*1024)
		rand.Read(v)
		b.batchTx.Lock()
		b.batchTx.UnsafePut([]byte("key"), []byte(fmt.Sprintf("foo%10d", i)), v)
		b.batchTx.Unlock()
	}
	// commits all pending writes
	b.batchTx.Commit()
	fmt.Println("pending:", b.batchTx.pending)

	// for long read transactions
	rangeStart, rangeEnd := []byte(fmt.Sprintf("foo%10d", 0)), []byte(fmt.Sprintf("foo%10d", keyN))

	readFunc := func(name string) {
		log.Printf("read %q waiting for Lock", name)

		now := time.Now()
		b.ReadTx().Lock()
		log.Printf("read %q acquired Lock", name)
		rn := time.Now()
		aa, bb := b.ReadTx().UnsafeRange([]byte("key"), rangeStart, rangeEnd, 0)
		b.ReadTx().Unlock()

		log.Printf("read %q finished (took %v, range took %v) %d %d", name, time.Since(now), time.Since(rn), len(aa), len(bb))
	}

	total := time.Now()
	donec := make(chan struct{})

	go func() {
		readFunc("A")
		donec <- struct{}{}
	}()

	go func() {
		time.Sleep(time.Second)
		log.Println("commit starts")

		now := time.Now()
		if block {
			b.batchTx.Commit()
		} else {
			if b.batchTx.safePending() != 0 {
				b.batchTx.Commit()
			}
		}

		log.Printf("commit finished (took %v)", time.Since(now))
		donec <- struct{}{}
	}()

	time.Sleep(2 * time.Second)
	readFunc("B")

	<-donec
	<-donec

	fmt.Println("total:", time.Since(total))
}

func (t *batchTx) safePending() int {
	t.Mutex.Lock()
	defer t.Mutex.Unlock()
	return t.pending
}

LGTM

@xiang90 xiang90 merged commit 0a20767 into etcd-io:master Feb 9, 2018
@xiang90 xiang90 deleted the c_r branch February 9, 2018 19:34
@gyuho
Copy link
Contributor

gyuho commented Feb 9, 2018

Instead, compared two compiled etcd process before and after this patch, and compare how periodic commit operation affects incoming read transactions.

I wrote 1M keys, and wait until there's no pending writes. Previously, commit for-loop with batch interval acquires writer locks even when there's no pending write. Thus, new read transactions to acquire reader locks were blocked by commit operation holding writer locks. Had to add some sleeps to have larger window between commit operation writer lock acquire and release, so that read transaction can kick in and be blocked.

Then, spawned two separate read transactions concurrently. Before, each read transaction takes about 2.7s. When launched concurrently, the second reader blocks with commit operation holding writer locks, thus taking >4.47s. After this patch, all read transactions take <3.3s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants