-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mvcc: allow large concurrent reads under light write workload #9296
Conversation
31af6df
to
8e8538b
Compare
I formed some new ideas to solve this issue + the problem #9199 tries to solve... so probably wait for my follow up commit :) |
I evaluated this branch with the benchmark and the result was similar to the master branch for now. Of course I'll wait your follow up commits.
Probably the ideal solution for reducing the blocking of read txns would be letting the readers deal with only snapshots and making them lock free (completely no For now, making read txns fine grained in a voluntary manner would be an ad-hoc but reasonable solution, I think. |
This PR will only help on large concurrent read request, but not on blocking puts.
Yea. We should dedicate large read request to another boltdb read tx after a commit happens after we receive the request instead of using the singleton read tx in current backend which is able to read the recent write in the writeback buffer. |
internal/mvcc/backend/batch_tx.go
Outdated
// no need to reset read tx if there is no write. | ||
// call commit to update stats. | ||
if t.batchTx.pending == 0 && !stop { | ||
t.batchTx.commit(stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit(false)
internal/mvcc/backend/batch_tx.go
Outdated
@@ -231,6 +232,13 @@ func (t *batchTxBuffered) CommitAndStop() { | |||
} | |||
|
|||
func (t *batchTxBuffered) commit(stop bool) { | |||
// no need to reset read tx if there is no write. | |||
// call commit to update stats. | |||
if t.batchTx.pending == 0 && !stop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this approach isn't working (deadlocking in CI?), it might be easier to move the pending == 0
paths to backend.run() and avoid calling commit() there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI panics since the tx initialization code goes through here. I am still playing with the code to see what is the easier. Probably I should try what you just suggested.
8ef182b
to
ff50041
Compare
Codecov Report
@@ Coverage Diff @@
## master #9296 +/- ##
=========================================
Coverage ? 75.58%
=========================================
Files ? 365
Lines ? 30695
Branches ? 0
=========================================
Hits ? 23202
Misses ? 5870
Partials ? 1623
Continue to review full report at Codecov.
|
/cc @heyitsanthony can you take another look? change to what you suggested. |
internal/mvcc/backend/backend.go
Outdated
b.batchTx.Commit() | ||
b.batchTx.Lock() | ||
pending := b.batchTx.pending | ||
b.batchTx.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q. Doesn't b.batchTx.Unlock
already reset pending
count?
pending := b.(*batchTxBuffered).pending
, wherepending
is 20k andbatchLimit
is 10k- Goes to
b.(*batchTxBuffered).Unlock()
t.pending >= t.backend.batchLimit
(*batchTxBuffered).commit(false)
(*batchTx).commit(false)
(*batchTx).pending = 0
Shouldn't this be b.batchTx.batchTx.Unlock()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh. yea. it should be the lock of the mutex.
Approach looks good. I will try benchmarking this code path by issuing two overlapping readers contending on |
Approach looks good. I will try benchmarking this code path by issuing two overlapping readers contending on backend.readTx.
@gyuho would you mind outlining this process in a little detail? Thanks!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
// if no pending writes, should not block concurrent read transactions
//
// without https://github.com/coreos/etcd/pull/9296
// 1. (*backend).(*batchTxBuffered).Commit()
// 2. (*backend).(*batchTxBuffered).Mutex.Lock
// 3. (*backend).(*readTx).RWMutex.Lock
// - blocks long running read transaction
//
// with https://github.com/coreos/etcd/pull/9296
// 1. (*backend).(*batchTxBuffered).safePending()
// 2. (*backend).(*batchTxBuffered).Mutex.Lock
/*
go test -v -run TestRead
*/
func TestReadBlockingWriteCommit(t *testing.T) { testRead(t, true) }
func TestReadNonBlockingWriteCommit(t *testing.T) { testRead(t, false) }
func testRead(t *testing.T, block bool) {
keyN := 1000000
// trigger commit manually, by giving large batch
// interval and limit
b, tmppath := NewTmpBackend(time.Hour, 2*keyN)
defer b.Close()
defer os.Remove(tmppath)
b.batchTx.UnsafeCreateBucket([]byte("key"))
for i := 0; i < keyN; i++ {
v := make([]byte, 3*1024)
rand.Read(v)
b.batchTx.Lock()
b.batchTx.UnsafePut([]byte("key"), []byte(fmt.Sprintf("foo%10d", i)), v)
b.batchTx.Unlock()
}
// commits all pending writes
b.batchTx.Commit()
fmt.Println("pending:", b.batchTx.pending)
// for long read transactions
rangeStart, rangeEnd := []byte(fmt.Sprintf("foo%10d", 0)), []byte(fmt.Sprintf("foo%10d", keyN))
readFunc := func(name string) {
log.Printf("read %q waiting for Lock", name)
now := time.Now()
b.ReadTx().Lock()
log.Printf("read %q acquired Lock", name)
rn := time.Now()
aa, bb := b.ReadTx().UnsafeRange([]byte("key"), rangeStart, rangeEnd, 0)
b.ReadTx().Unlock()
log.Printf("read %q finished (took %v, range took %v) %d %d", name, time.Since(now), time.Since(rn), len(aa), len(bb))
}
total := time.Now()
donec := make(chan struct{})
go func() {
readFunc("A")
donec <- struct{}{}
}()
go func() {
time.Sleep(time.Second)
log.Println("commit starts")
now := time.Now()
if block {
b.batchTx.Commit()
} else {
if b.batchTx.safePending() != 0 {
b.batchTx.Commit()
}
}
log.Printf("commit finished (took %v)", time.Since(now))
donec <- struct{}{}
}()
time.Sleep(2 * time.Second)
readFunc("B")
<-donec
<-donec
fmt.Println("total:", time.Since(total))
}
func (t *batchTx) safePending() int {
t.Mutex.Lock()
defer t.Mutex.Unlock()
return t.pending
} LGTM |
Instead, compared two compiled etcd process before and after this patch, and compare how periodic commit operation affects incoming read transactions. I wrote 1M keys, and wait until there's no pending writes. Previously, commit for-loop with batch interval acquires writer locks even when there's no pending write. Thus, new read transactions to acquire reader locks were blocked by commit operation holding writer locks. Had to add some sleeps to have larger window between commit operation writer lock acquire and release, so that read transaction can kick in and be blocked. Then, spawned two separate read transactions concurrently. Before, each read transaction takes about 2.7s. When launched concurrently, the second reader blocks with commit operation holding writer locks, thus taking >4.47s. After this patch, all read transactions take <3.3s. |
I found that etcd is unable to process large (take more than 100ms) read requests concurrently even under light or no write workload. We already put some effort to solve the current read problem by adding a writeback buffer in front of the database transaction. So I want to fix this problem.
The root cause of the problem is that although we use Read Lock to allow current read go-routines, but we have another long running go-routine that tries to flush data back to disk every 100ms which requires Write Lock.
Obviously, the write lock has a higher priority that read lock, and will always kick in to block any later read locks. So with this behavior, we basically put a barrier for every 100ms to block all reads until the write lock is acquired by the long running go-routine and then released. The write lock can only be acquired after all previous read locks are released, which can be seconds if the reads are large. Thus we can block reads for seconds even when there is no write.
However, if there is no pending write, we do not need to get the write lock at all.
This PR delays the routine to acquire the write lock only when there are pending writes. So under no or light write conditions, large read can be executed concurrently.
/cc @jpbetz @mitake @heyitsanthony