-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dramatic drop in sequential read performance in main compared to tag v1.3.8 #720
Comments
Would you mind using the same benchmark tool for test? |
@fuweid It is. I am just running a bash script to repeatedly run and collect everything. |
But the output are different. That's why I'm asking to using same version. |
I'm not sure I understand. The output of the Would you like me to attach the csv? |
Thanks for comparing the performance. My immediate feeling is that it might be caused by the bench tool itself. There are some changes on the tool itself in main branch. Please use the same tool as @fuweid suggested. You can build a binary using either 1.3.8 or main; afterwards, using the same binary to run benchmark for both 1.3.8 and main branch. |
So you'd like me to write a separate program and benchmark twice, one with each library versions? Just want to make sure I do the ask correctly :) |
Sorry for the confusion. This is a silly suggestion. The tool is bound to the library, so there is no easy way to separate them. release-1.3 and main branches also have big differences in term of implementation, it is also hard to simply copy the bench tool's source code from one branch to anther branch. Will take a closer look later. |
@ambaxter Could you try https://github.com/ahrtr/bbolt_bench ? Thanks |
|
The performance differences comes from the AddCompletedOps function. Swapping out |
|
I reran all of my performance metrics again with the atomic add removed. There's still about a 10-15% performance difference in sequential reads and sequential writes between the original v1.3.8 bench command and the main one.
|
Thanks for the info. I won't have time to dig into it until sometime next week. Could anyone evaluate whether or not we should fix the bench tool? Or is there any potential reason for the performance reduce? cc @tjungblu @fuweid @ivanvc @ambaxter I just updated the readme of bbolt_bench, I missed one command " |
Is this something specific to your mac? is that one of those newer ARM machines? ;-) But we certainly can move the increment to the bottom of the loop: |
@tjungblu, I don't think it's platform/OS specific. I'm running Linux AMD64 and seeing the same behavior. I also tried the suggestion to restore the performance with @ambaxter's suggestion, but I'm seeing the same performance. I tried moving the increment to the end of the loop, see diff below, but I'm getting the same result. @@ -1383,7 +1383,6 @@ func (cmd *benchCommand) runReadsSequential(db *bolt.DB, options *BenchOptions,
c := tx.Bucket(benchBucketName).Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
numReads++
- results.AddCompletedOps(1)
if v == nil {
return ErrInvalidValue
}
@@ -1397,6 +1396,7 @@ func (cmd *benchCommand) runReadsSequential(db *bolt.DB, options *BenchOptions,
if time.Since(t) >= time.Second {
break
}
+ results.AddCompletedOps(numReads)
} |
Nah, mine is an x86_64 machine :D I'm working on testing everything on my beefy Linux box once I get the latest version of Go installed Edit: I see the same behavior as @ivanvc. There's still a difference on my x86_64 Linux machine, but modifying the AddCompletedOps function doesn't restore performance like it does on my laptop. |
Updated. ENV: (16 vcpus, 64 GiB memory) 5.15.0-1060-azure x86_64 From main with GOGC=off:
From v1.3.8 with GOGC=off
It's randomly. Not sure it's related to IO. Sometimes, main is better and 1.3.8. But it's close. For main, cpu profile is like
For v1.3.8 $ GOGC=off ./bbolt_bench -count 100000 -batch-size 25000 -profile-mode r -cpuprofile ./138.cpu
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 66798199 requests, 66764418/s
# Write 100000(ops) 118.611014ms (1.186µs/op) (843170 op/sec)
# Read 66800000(ops) 1.000721611s (14ns/op) (71428571 op/sec)
$ go tool pprof ./138.cpu
File: bbolt_bench
Type: cpu
Time: Apr 12, 2024 at 6:42am (UTC)
Duration: 1.10s, Total samples = 1s (90.66%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 890ms, 89.00% of 1000ms total
Showing top 10 nodes out of 28
flat flat% sum% cum cum%
170ms 17.00% 17.00% 840ms 84.00% go.etcd.io/bbolt.(*Cursor).Next
140ms 14.00% 31.00% 140ms 14.00% go.etcd.io/bbolt.unsafeByteSlice (inline)
130ms 13.00% 44.00% 670ms 67.00% go.etcd.io/bbolt.(*Cursor).next
100ms 10.00% 54.00% 100ms 10.00% main.(*BenchResults).AddCompletedOps
80ms 8.00% 62.00% 310ms 31.00% go.etcd.io/bbolt.(*Cursor).keyValue
60ms 6.00% 68.00% 60ms 6.00% go.etcd.io/bbolt.(*elemRef).count (inline)
60ms 6.00% 74.00% 60ms 6.00% go.etcd.io/bbolt.(*elemRef).isLeaf (inline)
60ms 6.00% 80.00% 1000ms 100% main.(*benchCommand).runReads.(*benchCommand).runReadsSequential.func2
50ms 5.00% 85.00% 190ms 19.00% go.etcd.io/bbolt.(*Cursor).goToFirstElementOnTheStack
40ms 4.00% 89.00% 80ms 8.00% go.etcd.io/bbolt.(*leafPageElement).key (inline)
(pprof) |
I realized that my previous attempt to apply the suggestion was incorrect. This patch seems to restore the performance: index a5a4e9f..6526767 100644
--- a/cmd/bbolt/main.go
+++ b/cmd/bbolt/main.go
@@ -1383,7 +1383,6 @@ func (cmd *benchCommand) runReadsSequential(db *bolt.DB, options *BenchOptions,
c := tx.Bucket(benchBucketName).Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
numReads++
- results.AddCompletedOps(1)
if v == nil {
return ErrInvalidValue
}
@@ -1393,6 +1392,7 @@ func (cmd *benchCommand) runReadsSequential(db *bolt.DB, options *BenchOptions,
return fmt.Errorf("read seq: iter mismatch: expected %d, got %d", options.Iterations, numReads)
}
+ results.AddCompletedOps(numReads)
// Make sure we do this for at least a second.
if time.Since(t) >= time.Second {
break Can someone else verify? @fuweid, maybe while you're checking it yourself. |
@ivanvc I don't change bbolt_bench code. I rerun it in new box and I think it's randomly. Not sure it's related to local env. REF: #720 (comment) |
I mean, it makes sense that n-atomic operations are slower than one :) |
I opened PR #721, which addresses the read performance benchmark drop, following @tjungblu's suggestion. After changing it, I see consistent read numbers that match release 1.3's. However, I still see a drop in write performance compared to 1.3. I initially applied the same suggestion but didn't see an improvement. |
Quickly tried to confirm. Indeed the read performance seems to be slower because of the atomic increment. Batching it resolves it to the state of v1.3.8. The write performance however, does not get fixed by reducing the count to the atomic increment.
main + "atomic batching"
interesting case here is that runtime.getMCache is the one contributing to the loss of performance. EDIT: it's actually not entirely clear, every run seems slightly different in top contributors, eg.
|
Thanks @ambaxter for the finding, and thanks all for the testing & analysis. Confirmed the issue
Proposed solutionFor the first issue, we can add a field
We need to update for all public methods in the similar way. At least we should do it for Put, Delete, etc. which are most likely be called frequently. |
yes, please |
Hi @ahrtr, I've been reviewing the proposed solution for the write slowdown. One part that is unclear to me is how Lines 47 to 54 in b005c0c
However, if we use the same logger as the database it will still have a slowdown. When initializing the database, if the received logger is Lines 200 to 204 in b005c0c
So, what I was thinking is that instead of adding a new logger to the func (b *Bucket) Put(key []byte, value []byte) (err error) {
lg := b.tx.db.Logger()
if lg != discardLogger {
lg.Debugf("Putting key %q", string(key))
defer func() {
if err != nil {
lg.Errorf("Putting key %q failed: %v", string(key), err)
} else {
lg.Debugf("Putting key %q successfully", string(key))
}
}()
}
This way, there's no change to the db, err := bolt.Open(options.Path, 0600, &bolt.Options{Logger: nil}) After applying this change, I confirmed that the write performance matches 1.3's. |
Technically it works. But it looks a little strange. The purpose of Introducing Lines 200 to 204 in b005c0c
To
And still follow my above proposal. What do you think? |
Makes sense. I'll open a PR with this implementation. IMHO, one thing we'll need to address later is that the logger should not significantly impact performance. We could consider switching to |
bbolt just defines an Logger API, etcd passes in an logger instance which implements the interface. Actually what etcd passes in is already an Uber/zap instance. |
Is the performance issue coming from the formatting or the additional indirection of the function call?
Looks like the whole thing should just be entirely removed by the compiler, given debug is false - which it is. I've tried to help the compiler a little by just removing all log related statements in PUT:
So even when we would remove all logs, we're still missing a good chunk of performance towards okay edit: after bisecting this with git it's indeed pointing to #646 |
I've been briefly going through the objdump of those methods: that's with these statements entirely removed: just naively diffing, those are the additional instructions, which kinda explains the profile outputs above as well:
(sorry the line numbers are shifted by 4) Wonder if we could get those decompiled entirely if we're setting up the logger slightly different, will continue tomorrow unless Ivan doesn't catch it earlier :) |
@tjungblu, I'm just reading this. I tried, too, before removing the formatting without luck. I also tried creating a Logger interface with empty functions, but the gain, as you may have found, was marginal. The only solution that worked is what @ahrtr suggested. As I mentioned in the PR, it's not pretty, but it restores the original performance. However, I didn't go all the way to debug the compiled code. So, your analysis is more complete than mine, but we have similar results. |
yep, which is odd. The Go Compiler does know how to devirtualize and inline things, I just can't get the compiler to tell me why it won't do it in this scenario. The nullptr trick indeed works, even though it's quite annoying to have all those null checks. Another option could be to have an "enabled" flag, similar to how klog works:
I'll benchmark a bit more... |
TL; DRThe performance reduce (around 9%) should be caused by type conversion, e.g from
objdumpFor simplicity, I removed the defer function; so we only print one debug log. We can still reproduce the issue (write performance reduce)
Run commands below,
If we change the line to below, then can't reproduce the issue anymore. It has the same write performance as release-1.3.
Its objump is as below. Obviously it's much simpler. It doesn't execute the operations for type conversion, e.g.
SolutionSolution 1One solution is to check Good side:
Bad side:
We can enhance this solution based on @tjungblu 's comment above something like Solution 2The second solution is we only optimize for the frequently called methods, e.g.
Good side:
Bad side:
Solution 3Do no make any change, but document the performance impact. Good side:
Bad side:
bbolt is a library to be used by other applications. The performance reduce might be negligible in application E2E test or production environment. Can we compare the performance in etcd, e.g @ambaxter @fuweid @ivanvc @tjungblu WDYT? Probably we can go for solution 1. |
I believe what happens with
with
on the other hand is just completely removed from your code, I reckon because the debug flag is off? here the compiler seems to get it :)
given the go compiler can't properly optimize it away, there for sure must be some performance reduction. It might not be noticeable in etcd because of the raft latency and the async writing. I think Ivan's approach with the nil pointer is fine, but we need to be sure we don't NPE anywhere. |
As @ahrtr found out in my PR, finding every instance of a call to the logger is exhaustive. It may be better to go with option two and skip logging only in the most impactful functions. As per an etcd benchmark, I've been running them since yesterday. I'll publish the results tomorrow as soon as I finish running them (but it seems like @tjungblu is correct, and it is negligible). |
Thanks @ivanvc . If I understood it correctly, when R/W ratio increases, the performance difference also decreases. It aligns with our test with Based on all the discussion & test so far, we may want to go for solution 2.
|
Opened PR #741, that implements solution 2. |
Please feel free to reopen this issue if anyone can still reproduce it. |
Admittedly this is with the synthetic bench command and I only ran
go build
so perhaps I'm missing some additional configuration somewhere.There is also a minor drop in write performance
In tag v1.3.8
In main branch
I've done a bunch of performance testing in the last few weeks.
Please excuse the unformatted output from Jupyter
The text was updated successfully, but these errors were encountered: