Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dramatic drop in sequential read performance in main compared to tag v1.3.8 #720

Closed
ambaxter opened this issue Apr 9, 2024 · 41 comments
Closed

Comments

@ambaxter
Copy link
Contributor

ambaxter commented Apr 9, 2024

Admittedly this is with the synthetic bench command and I only ran go build so perhaps I'm missing some additional configuration somewhere.

There is also a minor drop in write performance

In tag v1.3.8

./cmd/bbolt/bbolt bench -profile-mode n -count 100000 -batch-size 25000
# Write 409.879168ms    (4.098µs/op)    (244021 op/sec)
# Read  1.000862234s    (21ns/op)       (47619047 op/sec)

In main branch

./bin/bbolt bench -profile-mode n -count 100000 -batch-size 25000
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 35766291 requests, 35758553/s 
# Write 100000(ops)     463.589694ms    (4.635µs/op)    (215749 op/sec)
# Read  35800000(ops)   1.001337296s    (27ns/op)       (37037037 op/sec)

I've done a bunch of performance testing in the last few weeks.

go_main_read

Please excuse the unformatted output from Jupyter

write mode: seq
runtime: go
batch
5000      49047618.8
10000     50000000.0
25000     50000000.0
50000     50000000.0
100000    50000000.0
Name: read_ops/s, dtype: float64
runtime: go_main
batch
5000      32870577.0
10000     31704545.2
25000     33118279.2
50000     33333333.0
100000    32903225.4
Name: read_ops/s, dtype: float64
write mode: rnd
runtime: go
batch
5000     52631578.0
10000    52631578.0
25000    52631578.0
50000    52631578.0
Name: read_ops/s, dtype: float64
runtime: go_main
batch
5000     34269293.4
10000    34252873.0
25000    33793103.0
50000    34022988.0
Name: read_ops/s, dtype: float64
write mode: seq-nest
runtime: go
batch
5000      49523809.4
10000     47619047.0
25000     48095237.6
50000     48095237.6
100000    47619047.0
Name: read_ops/s, dtype: float64
runtime: go_main
batch
5000      36772486.6
10000     37037037.0
25000     37037037.0
50000     37037037.0
100000    37606837.4
Name: read_ops/s, dtype: float64
write mode: rnd-nest
runtime: go
batch
5000     48095237.6
10000    47619047.0
25000    47186146.6
50000    48571428.2
Name: read_ops/s, dtype: float64
runtime: go_main
batch
5000     37037037.0
10000    37037037.0
25000    37037037.0
50000    37037037.0
Name: read_ops/s, dtype: float64

go_main_write

write mode: seq
runtime: go
batch
5000       88516.2
10000     148311.1
25000     231671.3
50000     292446.1
100000    304933.7
Name: write_ops/s, dtype: float64
runtime: go_main
batch
5000       85349.7
10000     133437.3
25000     214194.3
50000     249503.8
100000    249149.5
Name: write_ops/s, dtype: float64
write mode: rnd
runtime: go
batch
5000      62147.4
10000    105808.8
25000     94796.9
50000     25114.0
Name: write_ops/s, dtype: float64
runtime: go_main
batch
5000     58260.9
10000    99027.5
25000    87753.0
50000    23536.2
Name: write_ops/s, dtype: float64
write mode: seq-nest
runtime: go
batch
5000       90155.7
10000     156581.2
25000     262004.9
50000     314087.4
100000    296105.9
Name: write_ops/s, dtype: float64
runtime: go_main
batch
5000       88903.5
10000     147617.1
25000     237686.2
50000     286197.9
100000    271294.7
Name: write_ops/s, dtype: float64
write mode: rnd-nest
runtime: go
batch
5000     75666.0
10000    87629.4
25000    38098.6
50000    13132.4
Name: write_ops/s, dtype: float64
runtime: go_main
batch
5000     74473.3
10000    85192.2
25000    38540.1
50000    13496.1
Name: write_ops/s, dtype: float64
@fuweid
Copy link
Member

fuweid commented Apr 9, 2024

Would you mind using the same benchmark tool for test?

@ambaxter
Copy link
Contributor Author

ambaxter commented Apr 9, 2024

Would you mind using the same benchmark tool for test?

@fuweid It is. I am just running a bash script to repeatedly run and collect everything.

@fuweid
Copy link
Member

fuweid commented Apr 9, 2024

Would you mind using the same benchmark tool for test?

@fuweid It is. I am just running a bash script to repeatedly run and collect everything.

But the output are different. That's why I'm asking to using same version.

@ambaxter
Copy link
Contributor Author

ambaxter commented Apr 9, 2024

Would you mind using the same benchmark tool for test?

@fuweid It is. I am just running a bash script to repeatedly run and collect everything.

But the output are different. That's why I'm asking to using same version.

I'm not sure I understand. The output of the bbolt bench command changed between v1.3.8 to main. My script just gathers the op/s field and turns it into a csv.

Would you like me to attach the csv?

@ahrtr
Copy link
Member

ahrtr commented Apr 9, 2024

Thanks for comparing the performance. My immediate feeling is that it might be caused by the bench tool itself. There are some changes on the tool itself in main branch. Please use the same tool as @fuweid suggested. You can build a binary using either 1.3.8 or main; afterwards, using the same binary to run benchmark for both 1.3.8 and main branch.

@ambaxter
Copy link
Contributor Author

ambaxter commented Apr 9, 2024

So you'd like me to write a separate program and benchmark twice, one with each library versions? Just want to make sure I do the ask correctly :)

@ahrtr
Copy link
Member

ahrtr commented Apr 9, 2024

You can build a binary using either 1.3.8 or main; afterwards, using the same binary to run benchmark for both 1.3.8 and main branch.

Sorry for the confusion. This is a silly suggestion. The tool is bound to the library, so there is no easy way to separate them. release-1.3 and main branches also have big differences in term of implementation, it is also hard to simply copy the bench tool's source code from one branch to anther branch.

Will take a closer look later.

@ahrtr
Copy link
Member

ahrtr commented Apr 9, 2024

@ambaxter Could you try https://github.com/ahrtr/bbolt_bench ? Thanks

@ambaxter
Copy link
Contributor Author

ambaxter commented Apr 9, 2024

@ambaxter Could you try https://github.com/ahrtr/bbolt_bench ? Thanks

@ahrtr

abaxter@MacBook-Pro bbolt_bench % go get go.etcd.io/bbolt@67165811e57a79678b6fab9b029bc032b9dfef0e

go: downloading go.etcd.io/bbolt v1.4.0-alpha.0.0.20240408210737-67165811e57a
go: downloading golang.org/x/sys v0.19.0
go: downloading golang.org/x/sync v0.7.0
go: upgraded go.etcd.io/bbolt v1.3.9 => v1.4.0-alpha.0.0.20240408210737-67165811e57a
go: upgraded golang.org/x/sys v0.4.0 => v0.19.0
abaxter@MacBook-Pro bbolt_bench % go build
abaxter@MacBook-Pro bbolt_bench % ./bbolt_bench -count 100000 -batch-size 25000
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 35854350 requests, 35816956/s
# Write	100000(ops)	457.862921ms	(4.578µs/op)	(218435 op/sec)
# Read	35900000(ops)	1.002397812s	(27ns/op)	(37037037 op/sec)

abaxter@MacBook-Pro bbolt_bench % make clean
rm -f ./bbolt_bench
git checkout -- .
abaxter@MacBook-Pro bbolt_bench % go get go.etcd.io/[email protected]
go: downgraded go.etcd.io/bbolt v1.3.9 => v1.3.8
abaxter@MacBook-Pro bbolt_bench % go build
abaxter@MacBook-Pro bbolt_bench % ./bbolt_bench -count 100000 -batch-size 25000
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 36154347 requests, 36138134/s
# Write	100000(ops)	436.711169ms	(4.367µs/op)	(228990 op/sec)
# Read	36200000(ops)	1.001959606s	(27ns/op)	(37037037 op/sec)

@ambaxter
Copy link
Contributor Author

The performance differences comes from the AddCompletedOps function. Swapping out atomic.AddInt64(&r.completedOps, amount) for r.completedOps += amount restores the expected performance metrics.

@ambaxter
Copy link
Contributor Author

./bin/bbolt bench -write-mode rnd -read-mode seq
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
starting read benchmark.
# Write 1000(ops)       65.604715ms     (65.604µs/op)   (15242 op/sec)
# Read  50406000(ops)   1.000010131s    (19ns/op)       (52631578 op/sec)

@ambaxter
Copy link
Contributor Author

ambaxter commented Apr 10, 2024

I reran all of my performance metrics again with the atomic add removed. There's still about a 10-15% performance difference in sequential reads and sequential writes between the original v1.3.8 bench command and the main one.

abaxter@MacBook-Pro bbolt_bench % go get go.etcd.io/[email protected]                                  
go: downgraded go.etcd.io/bbolt v1.4.0-alpha.0.0.20240408210737-67165811e57a => v1.3.8
abaxter@MacBook-Pro bbolt_bench % go build                                                     
abaxter@MacBook-Pro bbolt_bench % ./bbolt_bench -count 100000 -batch-size 25000 -profile-mode n
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 45458680 requests, 45452282/s 
# Write 100000(ops)     425.134263ms    (4.251µs/op)    (235238 op/sec)
# Read  45500000(ops)   1.001095171s    (22ns/op)       (45454545 op/sec)
abaxter@MacBook-Pro bbolt_bench % go get go.etcd.io/bbolt@67165811e57a79678b6fab9b029bc032b9dfef0e
go: upgraded go.etcd.io/bbolt v1.3.9 => v1.4.0-alpha.0.0.20240408210737-67165811e57a
go: upgraded golang.org/x/sys v0.4.0 => v0.19.0
abaxter@MacBook-Pro bbolt_bench % go build                                                        
abaxter@MacBook-Pro bbolt_bench % ./bbolt_bench -count 100000 -batch-size 25000 -profile-mode n   
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
# Write 100000(ops)     500.429686ms    (5.004µs/op)    (199840 op/sec)
# Read  45800000(ops)   1.00064931s     (21ns/op)       (47619047 op/sec)

go_main_write_new

@ahrtr
Copy link
Member

ahrtr commented Apr 11, 2024

I reran all of my performance metrics again with the atomic add removed. There's still about a 10-15% performance difference in sequential reads and sequential writes between the original v1.3.8 bench command and the main one.

Thanks for the info. I won't have time to dig into it until sometime next week. Could anyone evaluate whether or not we should fix the bench tool? Or is there any potential reason for the performance reduce? cc @tjungblu @fuweid @ivanvc

@ambaxter I just updated the readme of bbolt_bench, I missed one command "go mod tidy". Sorry for that. Please also ensure your golang version is 1.22.2.

@tjungblu
Copy link
Contributor

Is this something specific to your mac? is that one of those newer ARM machines? ;-)

But we certainly can move the increment to the bottom of the loop:
https://github.com/etcd-io/bbolt/blob/main/cmd/bbolt/main.go#L1484

@ivanvc
Copy link
Member

ivanvc commented Apr 11, 2024

@tjungblu, I don't think it's platform/OS specific. I'm running Linux AMD64 and seeing the same behavior.

I also tried the suggestion to restore the performance with @ambaxter's suggestion, but I'm seeing the same performance.

I tried moving the increment to the end of the loop, see diff below, but I'm getting the same result.

@@ -1383,7 +1383,6 @@ func (cmd *benchCommand) runReadsSequential(db *bolt.DB, options *BenchOptions,
                        c := tx.Bucket(benchBucketName).Cursor()
                        for k, v := c.First(); k != nil; k, v = c.Next() {
                                numReads++
-                               results.AddCompletedOps(1)
                                if v == nil {
                                        return ErrInvalidValue
                                }
@@ -1397,6 +1396,7 @@ func (cmd *benchCommand) runReadsSequential(db *bolt.DB, options *BenchOptions,
                        if time.Since(t) >= time.Second {
                                break
                        }
+                       results.AddCompletedOps(numReads)
                }

@ambaxter
Copy link
Contributor Author

ambaxter commented Apr 11, 2024

Is this something specific to your mac? is that one of those newer ARM machines? ;-)

But we certainly can move the increment to the bottom of the loop: https://github.com/etcd-io/bbolt/blob/main/cmd/bbolt/main.go#L1484

Nah, mine is an x86_64 machine :D

I'm working on testing everything on my beefy Linux box once I get the latest version of Go installed

Edit: I see the same behavior as @ivanvc. There's still a difference on my x86_64 Linux machine, but modifying the AddCompletedOps function doesn't restore performance like it does on my laptop.

@fuweid
Copy link
Member

fuweid commented Apr 12, 2024

Reproduce it with x86. Checking.

Updated.


ENV: (16 vcpus, 64 GiB memory) 5.15.0-1060-azure x86_64

From main with GOGC=off:

$ go get go.etcd.io/bbolt@67165811e57a79678b6fab9b029bc032b9dfef0e
go: upgraded go.etcd.io/bbolt v1.3.9 => v1.4.0-alpha.0.0.20240408210737-67165811e57a
go: upgraded golang.org/x/sys v0.4.0 => v0.19.0


$ go build .

$ GOGC=off ./bbolt_bench -count 100000 -batch-size 25000 -profile-mode n
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 66084093 requests, 66080544/s
# Write 100000(ops)     131.99567ms     (1.319µs/op)    (758150 op/sec)
# Read  66100000(ops)   1.000631717s    (15ns/op)       (66666666 op/sec)

$ GOGC=off ./bbolt_bench -count 100000 -batch-size 25000 -profile-mode n
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 66979289 requests, 66922679/s
# Write 100000(ops)     129.524857ms    (1.295µs/op)    (772200 op/sec)
# Read  67000000(ops)   1.00127892s     (14ns/op)       (71428571 op/sec)

From v1.3.8 with GOGC=off

$ go get go.etcd.io/[email protected]
go: downgraded go.etcd.io/bbolt v1.3.9 => v1.3.8
$ go build
$ GOGC=off ./bbolt_bench -count 100000 -batch-size 25000 -profile-mode n
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 62989967 requests, 62943313/s
# Write 100000(ops)     114.549737ms    (1.145µs/op)    (873362 op/sec)
# Read  63000000(ops)   1.001056755s    (15ns/op)       (66666666 op/sec)

$ GOGC=off ./bbolt_bench -count 100000 -batch-size 25000 -profile-mode n
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
# Write 100000(ops)     118.794487ms    (1.187µs/op)    (842459 op/sec)
# Read  63200000(ops)   1.000193823s    (15ns/op)       (66666666 op/sec)

It's randomly. Not sure it's related to IO. Sometimes, main is better and 1.3.8. But it's close.

For main, cpu profile is like

$ GOGC=off ./bbolt_bench -count 100000 -batch-size 25000 -profile-mode r -cpuprofile ./main.cpu
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 66700000 requests, 66653578/s
# Write 100000(ops)     131.557594ms    (1.315µs/op)    (760456 op/sec)
# Read  66700000(ops)   1.000032426s    (14ns/op)       (71428571 op/sec)


$ go tool pprof ./main.cpu
File: bbolt_bench
Type: cpu
Time: Apr 12, 2024 at 6:44am (UTC)
Duration: 1.10s, Total samples = 1s (90.65%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 870ms, 87.00% of 1000ms total
Showing top 10 nodes out of 31
      flat  flat%   sum%        cum   cum%
     210ms 21.00% 21.00%      690ms 69.00%  go.etcd.io/bbolt.(*Cursor).next
     200ms 20.00% 41.00%      890ms 89.00%  go.etcd.io/bbolt.(*Cursor).Next
     140ms 14.00% 55.00%      140ms 14.00%  go.etcd.io/bbolt/internal/common.UnsafeByteSlice (inline)
      80ms  8.00% 63.00%       80ms  8.00%  main.(*BenchResults).AddCompletedOps
      60ms  6.00% 69.00%      250ms 25.00%  go.etcd.io/bbolt.(*Cursor).goToFirstElementOnTheStack
      50ms  5.00% 74.00%       80ms  8.00%  go.etcd.io/bbolt.(*elemRef).isLeaf (inline)
      40ms  4.00% 78.00%       90ms  9.00%  runtime.mallocgc
      30ms  3.00% 81.00%      200ms 20.00%  go.etcd.io/bbolt.(*Cursor).keyValue
      30ms  3.00% 84.00%       40ms  4.00%  go.etcd.io/bbolt.(*elemRef).count (inline)
      30ms  3.00% 87.00%       30ms  3.00%  go.etcd.io/bbolt/internal/common.(*Page).IsLeafPage (inline)
(pprof)

For v1.3.8

$ GOGC=off ./bbolt_bench -count 100000 -batch-size 25000 -profile-mode r -cpuprofile ./138.cpu
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 66798199 requests, 66764418/s
# Write 100000(ops)     118.611014ms    (1.186µs/op)    (843170 op/sec)
# Read  66800000(ops)   1.000721611s    (14ns/op)       (71428571 op/sec)

$ go tool pprof ./138.cpu
File: bbolt_bench
Type: cpu
Time: Apr 12, 2024 at 6:42am (UTC)
Duration: 1.10s, Total samples = 1s (90.66%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 890ms, 89.00% of 1000ms total
Showing top 10 nodes out of 28
      flat  flat%   sum%        cum   cum%
     170ms 17.00% 17.00%      840ms 84.00%  go.etcd.io/bbolt.(*Cursor).Next
     140ms 14.00% 31.00%      140ms 14.00%  go.etcd.io/bbolt.unsafeByteSlice (inline)
     130ms 13.00% 44.00%      670ms 67.00%  go.etcd.io/bbolt.(*Cursor).next
     100ms 10.00% 54.00%      100ms 10.00%  main.(*BenchResults).AddCompletedOps
      80ms  8.00% 62.00%      310ms 31.00%  go.etcd.io/bbolt.(*Cursor).keyValue
      60ms  6.00% 68.00%       60ms  6.00%  go.etcd.io/bbolt.(*elemRef).count (inline)
      60ms  6.00% 74.00%       60ms  6.00%  go.etcd.io/bbolt.(*elemRef).isLeaf (inline)
      60ms  6.00% 80.00%     1000ms   100%  main.(*benchCommand).runReads.(*benchCommand).runReadsSequential.func2
      50ms  5.00% 85.00%      190ms 19.00%  go.etcd.io/bbolt.(*Cursor).goToFirstElementOnTheStack
      40ms  4.00% 89.00%       80ms  8.00%  go.etcd.io/bbolt.(*leafPageElement).key (inline)
(pprof)

@ivanvc
Copy link
Member

ivanvc commented Apr 12, 2024

I realized that my previous attempt to apply the suggestion was incorrect. This patch seems to restore the performance:

index a5a4e9f..6526767 100644
--- a/cmd/bbolt/main.go
+++ b/cmd/bbolt/main.go
@@ -1383,7 +1383,6 @@ func (cmd *benchCommand) runReadsSequential(db *bolt.DB, options *BenchOptions,
                        c := tx.Bucket(benchBucketName).Cursor()
                        for k, v := c.First(); k != nil; k, v = c.Next() {
                                numReads++
-                               results.AddCompletedOps(1)
                                if v == nil {
                                        return ErrInvalidValue
                                }
@@ -1393,6 +1392,7 @@ func (cmd *benchCommand) runReadsSequential(db *bolt.DB, options *BenchOptions,
                                return fmt.Errorf("read seq: iter mismatch: expected %d, got %d", options.Iterations, numReads)
                        }
 
+                       results.AddCompletedOps(numReads)
                        // Make sure we do this for at least a second.
                        if time.Since(t) >= time.Second {
                                break

Can someone else verify? @fuweid, maybe while you're checking it yourself.

@fuweid
Copy link
Member

fuweid commented Apr 12, 2024

@ivanvc I don't change bbolt_bench code. I rerun it in new box and I think it's randomly. Not sure it's related to local env.

REF: #720 (comment)

@tjungblu
Copy link
Contributor

I mean, it makes sense that n-atomic operations are slower than one :)

@ivanvc
Copy link
Member

ivanvc commented Apr 13, 2024

I opened PR #721, which addresses the read performance benchmark drop, following @tjungblu's suggestion. After changing it, I see consistent read numbers that match release 1.3's. However, I still see a drop in write performance compared to 1.3. I initially applied the same suggestion but didn't see an improvement.

@tjungblu
Copy link
Contributor

tjungblu commented Apr 17, 2024

Quickly tried to confirm. Indeed the read performance seems to be slower because of the atomic increment. Batching it resolves it to the state of v1.3.8.

The write performance however, does not get fixed by reducing the count to the atomic increment.

[tjungblu ~/git/bbolt]$ git status
HEAD detached at v1.3.8
nothing to commit, working tree clean
[tjungblu ~/git/bbolt]$ go build -o bin/bbolt ./cmd/bbolt
[tjungblu ~/git/bbolt]$ GOGC=off bin/bbolt bench -count 100000 -batch-size 25000 -profile-mode n
# Write	50.480517ms	(504ns/op)	(1984126 op/sec)
# Read	1.000011109s	(8ns/op)	(125000000 op/sec)
[tjungblu ~/git/bbolt]$ git status
HEAD detached at 6716581
nothing to commit, working tree clean
[tjungblu ~/git/bbolt]$ make build
go build -o bin/bbolt ./cmd/bbolt

[tjungblu ~/git/bbolt]$ GOGC=off bin/bbolt bench -count 100000 -batch-size 25000 -profile-mode n
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 89190366 requests, 89187621/s 
# Write	100000(ops)	79.24252ms	(792ns/op)	(1262626 op/sec)
# Read	89200000(ops)	1.000304151s	(11ns/op)	(90909090 op/sec)

main + "atomic batching"
https://gist.github.com/tjungblu/2a6f1ce81e4120b4ab30cd9bbf1e7839

[tjungblu ~/git/bbolt]$ GOGC=off bin/bbolt bench -count 100000 -batch-size 25000 -profile-mode n
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
Completed 112400000 requests, 112390184/s 
# Write	100000(ops)	77.184585ms	(771ns/op)	(1297016 op/sec)
# Read	112500000(ops)	1.000507597s	(8ns/op)	(125000000 op/sec)
main + "atomic batching with profile"

[tjungblu ~/git/bbolt]$ GOGC=off bin/bbolt bench -count 100000 -batch-size 25000 -profile-mode w -cpuprofile ./main.cpu
starting write benchmark.
Starting write iteration 0
Finished write iteration 0
Starting write iteration 25000
Finished write iteration 25000
Starting write iteration 50000
Finished write iteration 50000
Starting write iteration 75000
Finished write iteration 75000
starting read benchmark.
# Write	100000(ops)	78.613074ms	(786ns/op)	(1272264 op/sec)
# Read	110100000(ops)	1.000191976s	(9ns/op)	(111111111 op/sec)

[tjungblu ~/git/bbolt]$ go tool pprof ./main.cpu 
File: bbolt
Type: cpu
Time: Apr 17, 2024 at 2:31pm (CEST)
Duration: 201.20ms, Total samples = 80ms (39.76%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 80ms, 100% of 80ms total
Showing top 10 nodes out of 40
      flat  flat%   sum%        cum   cum%
      20ms 25.00% 25.00%       20ms 25.00%  runtime.getMCache (inline)
      10ms 12.50% 37.50%       10ms 12.50%  cmpbody
      10ms 12.50% 50.00%       10ms 12.50%  go.etcd.io/bbolt/internal/common.(*Inode).Key (inline)
      10ms 12.50% 62.50%       10ms 12.50%  runtime.bool2int
      10ms 12.50% 75.00%       10ms 12.50%  runtime.nanotime
      10ms 12.50% 87.50%       10ms 12.50%  runtime.nextFreeFast (inline)
      10ms 12.50%   100%       30ms 37.50%  sort.Search
         0     0%   100%       10ms 12.50%  bytes.Compare (inline)
         0     0%   100%       50ms 62.50%  go.etcd.io/bbolt.(*Bucket).Put
         0     0%   100%       10ms 12.50%  go.etcd.io/bbolt.(*Bucket).Put.func1

interesting case here is that runtime.getMCache is the one contributing to the loss of performance.

EDIT: it's actually not entirely clear, every run seems slightly different in top contributors, eg.

some run:
------------
Showing nodes accounting for 70ms, 100% of 70ms total
Showing top 10 nodes out of 41
      flat  flat%   sum%        cum   cum%
      20ms 28.57% 28.57%       20ms 28.57%  runtime.deductAssistCredit
      20ms 28.57% 57.14%       20ms 28.57%  runtime.memclrNoHeapPointers
      10ms 14.29% 71.43%       10ms 14.29%  cmpbody
      10ms 14.29% 85.71%       10ms 14.29%  go.etcd.io/bbolt.(*Cursor).keyValue
      10ms 14.29%   100%       50ms 71.43%  runtime.mallocgc
         0     0%   100%       10ms 14.29%  bytes.Compare (inline)
         0     0%   100%       60ms 85.71%  go.etcd.io/bbolt.(*Bucket).Put
         0     0%   100%       10ms 14.29%  go.etcd.io/bbolt.(*Bucket).Put.func1
         0     0%   100%       10ms 14.29%  go.etcd.io/bbolt.(*Bucket).dereference
         0     0%   100%       10ms 14.29%  go.etcd.io/bbolt.(*Bucket).spill


some other run:
------------
Showing nodes accounting for 70ms, 100% of 70ms total
Showing top 10 nodes out of 40
      flat  flat%   sum%        cum   cum%
      20ms 28.57% 28.57%       40ms 57.14%  runtime.mallocgc
      10ms 14.29% 42.86%       10ms 14.29%  go.etcd.io/bbolt.(*Cursor).nsearch.func1
      10ms 14.29% 57.14%       10ms 14.29%  go.etcd.io/bbolt.(*DefaultLogger).Debugf
      10ms 14.29% 71.43%       10ms 14.29%  go.etcd.io/bbolt.(*node).put
      10ms 14.29% 85.71%       10ms 14.29%  runtime.(*sysMemStat).add
      10ms 14.29%   100%       10ms 14.29%  runtime.nextFreeFast (inline)
         0     0%   100%       60ms 85.71%  go.etcd.io/bbolt.(*Bucket).Put
         0     0%   100%       10ms 14.29%  go.etcd.io/bbolt.(*Bucket).dereference
         0     0%   100%       10ms 14.29%  go.etcd.io/bbolt.(*Bucket).spill
         0     0%   100%       10ms 14.29%  go.etcd.io/bbolt.(*Cursor).nsearch

( v1.3.8)
HEAD is now at 42a914d Merge pull request #586 from ahrtr/1.3_64bit_align_20231025
[tjungblu ~/git/bbolt]$ rm -rf bin/
[tjungblu ~/git/bbolt]$ go build -o bin/bbolt ./cmd/bbolt
[tjungblu ~/git/bbolt]$ GOGC=off bin/bbolt bench -count 100000 -batch-size 25000 -profile-mode w -cpuprofile ./main.cpu
# Write	53.289428ms	(532ns/op)	(1879699 op/sec)
# Read	1.000647943s	(8ns/op)	(125000000 op/sec)

[tjungblu ~/git/bbolt]$ go tool pprof ./main.cpu 
File: bbolt
Type: cpu
Time: Apr 17, 2024 at 2:40pm (CEST)
Duration: 201ms, Total samples = 60ms (29.85%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 60ms, 100% of 60ms total
Showing top 10 nodes out of 44
      flat  flat%   sum%        cum   cum%
      10ms 16.67% 16.67%       10ms 16.67%  cmpbody
      10ms 16.67% 33.33%       20ms 33.33%  go.etcd.io/bbolt.(*node).dereference
      10ms 16.67% 50.00%       10ms 16.67%  runtime.memmove
      10ms 16.67% 66.67%       10ms 16.67%  runtime.nextFreeFast (inline)
      10ms 16.67% 83.33%       10ms 16.67%  runtime/internal/atomic.(*Uint32).Add (inline)
      10ms 16.67%   100%       10ms 16.67%  runtime/internal/syscall.Syscall6
         0     0%   100%       10ms 16.67%  bytes.Compare (inline)
         0     0%   100%       30ms 50.00%  go.etcd.io/bbolt.(*Bucket).Put
         0     0%   100%       20ms 33.33%  go.etcd.io/bbolt.(*Bucket).dereference
         0     0%   100%       20ms 33.33%  go.etcd.io/bbolt.(*Bucket).spill

@ahrtr
Copy link
Member

ahrtr commented Apr 19, 2024

Thanks @ambaxter for the finding, and thanks all for the testing & analysis.

Confirmed the issue

  • The performance on write is caused by the logger added in main.

    bbolt/bucket.go

    Lines 448 to 456 in f5447f0

    lg := b.tx.db.Logger()
    lg.Debugf("Putting key %q", string(key))
    defer func() {
    if err != nil {
    lg.Errorf("Putting key %q failed: %v", string(key), err)
    } else {
    lg.Debugf("Putting key %q successfully", string(key))
    }
    }()
  • The read performance decreasing can be resolved by aggregating the adding operations as bench: aggregate adding completed ops for reads #721 does.

Proposed solution

For the first issue, we can add a field lg Logger into Bucket, and update Put something like below,

func (b *Bucket) Put(key []byte, value []byte) (err error) {
	if b.lg != nil {
		b.lg.Debugf("Putting key %q", string(key))
		defer func() {
			if err != nil {
				b.lg.Errorf("Putting key %q failed: %v", string(key), err)
			} else {
				b.lg.Debugf("Putting key %q successfully", string(key))
			}
		}()
	}
	...
}

We need to update for all public methods in the similar way. At least we should do it for Put, Delete, etc. which are most likely be called frequently.

@ivanvc
Copy link
Member

ivanvc commented Apr 19, 2024

@ahrtr, I'll finish the changes to address the read issues in #720. I can also work on a proposal for the write solution if that's fine; I think I can work on it this weekend and have a draft early next week.

@ahrtr
Copy link
Member

ahrtr commented Apr 20, 2024

I can also work on a proposal for the write solution if that's fine;

yes, please

@ivanvc
Copy link
Member

ivanvc commented Apr 22, 2024

Hi @ahrtr, I've been reviewing the proposed solution for the write slowdown. One part that is unclear to me is how Bucket will receive a different logger. It could read it from db.logger in newBucket(tx *Tx) Bucket:

bbolt/bucket.go

Lines 47 to 54 in b005c0c

func newBucket(tx *Tx) Bucket {
var b = Bucket{tx: tx, FillPercent: DefaultFillPercent}
if tx.writable {
b.buckets = make(map[string]*Bucket)
b.nodes = make(map[common.Pgid]*node)
}
return b
}

However, if we use the same logger as the database it will still have a slowdown. When initializing the database, if the received logger is nil, it will still use the discardLogger (and the performance with this logger is still affected):

bbolt/db.go

Lines 200 to 204 in b005c0c

if options.Logger == nil {
db.logger = getDiscardLogger()
} else {
db.logger = options.Logger
}

So, what I was thinking is that instead of adding a new logger to the Bucket, adding the following condition before logging, checking if the logger is the discardLogger, and if so, skip the logging:

func (b *Bucket) Put(key []byte, value []byte) (err error) {
	lg := b.tx.db.Logger()
	if lg != discardLogger {
		lg.Debugf("Putting key %q", string(key))
		defer func() {
			if err != nil {
				lg.Errorf("Putting key %q failed: %v", string(key), err)
			} else {
				lg.Debugf("Putting key %q successfully", string(key))
			}
		}()
	}

This way, there's no change to the Bucket struct. We'll also need to change the initialization of the database in the bench command to:

db, err := bolt.Open(options.Path, 0600, &bolt.Options{Logger: nil})

After applying this change, I confirmed that the write performance matches 1.3's.

@ahrtr
Copy link
Member

ahrtr commented Apr 23, 2024

So, what I was thinking is that instead of adding a new logger to the Bucket, adding the following condition before logging, checking if the logger is the discardLogger, and if so, skip the logging:

Technically it works. But it looks a little strange. The purpose of Introducing discardLogger is to get rid of such comparison.
Suggest to update

bbolt/db.go

Lines 200 to 204 in b005c0c

if options.Logger == nil {
db.logger = getDiscardLogger()
} else {
db.logger = options.Logger
}

To

 if options.Logger != nil { 
 	db.logger = options.Logger 
 } 

And still follow my above proposal. What do you think?

@ivanvc
Copy link
Member

ivanvc commented Apr 23, 2024

And still follow my above proposal. What do you think?

Makes sense. I'll open a PR with this implementation.

IMHO, one thing we'll need to address later is that the logger should not significantly impact performance. We could consider switching to Uber/zap in the next version (of course, after benchmarking).

@ahrtr
Copy link
Member

ahrtr commented Apr 23, 2024

We could consider switching to Uber/zap in the next version (of course, after benchmarking).

bbolt just defines an Logger API, etcd passes in an logger instance which implements the interface. Actually what etcd passes in is already an Uber/zap instance.

@tjungblu
Copy link
Contributor

tjungblu commented Apr 25, 2024

Is the performance issue coming from the formatting or the additional indirection of the function call?

lg.Debugf("Putting key %q", string(key))

---

func (l *DefaultLogger) Debug(v ...interface{}) {
	if l.debug {
		_ = l.Output(calldepth, header("DEBUG", fmt.Sprint(v...)))
	}
}

Looks like the whole thing should just be entirely removed by the compiler, given debug is false - which it is.

I've tried to help the compiler a little by just removing all log related statements in PUT:

current HEAD:
# Write	1000000(ops)	828.889531ms	(828ns/op)	(1207729 op/sec)

with first debugf log removed:
# Write	1000000(ops)	776.707483ms	(776ns/op)	(1288659 op/sec)

with all log related calls removed 
# Write	1000000(ops)	699.430942ms	(699ns/op)	(1430615 op/sec)

So even when we would remove all logs, we're still missing a good chunk of performance towards (504ns/op) that we had before.

okay edit: after bisecting this with git it's indeed pointing to #646
here's the larger diff for finding the remaining 200ns :)
4c7075e...main

@tjungblu
Copy link
Contributor

tjungblu commented Apr 25, 2024

I've been briefly going through the objdump of those methods:
main: bucket_put_log.txt

that's with these statements entirely removed:
https://github.com/etcd-io/bbolt/blob/main/bucket.go#L448-L456
bucket_put_nolog.txt

just naively diffing, those are the additional instructions, which kinda explains the profile outputs above as well:

  bucket.go:444		0x49ca4e		e8ad0ff7ff		CALL runtime.newobject(SB)					
  bucket.go:444		0x49ca53		4889842480000000	MOVQ AX, 0x80(SP)						
  bucket.go:444		0x49ca5b		488b9c24f8000000	MOVQ 0xf8(SP), BX						
  bucket.go:444		0x49ca63		488b8c2400010000	MOVQ 0x100(SP), CX						
  bucket.go:444		0x49ca6b		31c0			XORL AX, AX							
  bucket.go:444		0x49ca6d		e8ce27fbff		CALL runtime.slicebytetostring(SB)				
  bucket.go:444		0x49ca72		e809e2f6ff		CALL runtime.convTstring(SB)					
  bucket.go:444		0x49ca77		488d0d02df0100		LEAQ 0x1df02(IP), CX						
  bucket.go:444		0x49ca7e		488bbc2480000000	MOVQ 0x80(SP), DI						
  bucket.go:444		0x49ca86		48890f			MOVQ CX, 0(DI)							
  bucket.go:444		0x49ca89		833d202f150000		CMPL runtime.writeBarrier(SB), $0x0				
  bucket.go:444		0x49ca90		7410			JE 0x49caa2							
  bucket.go:444		0x49ca92		e84970fcff		CALL runtime.gcWriteBarrier2(SB)				
  bucket.go:444		0x49ca97		498903			MOVQ AX, 0(R11)							
  bucket.go:444		0x49ca9a		488b5708		MOVQ 0x8(DI), DX						
  bucket.go:444		0x49ca9e		49895308		MOVQ DX, 0x8(R11)						
  bucket.go:444		0x49caa2		48894708		MOVQ AX, 0x8(DI)		

(sorry the line numbers are shifted by 4)

Wonder if we could get those decompiled entirely if we're setting up the logger slightly different, will continue tomorrow unless Ivan doesn't catch it earlier :)

@ivanvc
Copy link
Member

ivanvc commented Apr 25, 2024

@tjungblu, I'm just reading this. I tried, too, before removing the formatting without luck. I also tried creating a Logger interface with empty functions, but the gain, as you may have found, was marginal. The only solution that worked is what @ahrtr suggested. As I mentioned in the PR, it's not pretty, but it restores the original performance.

However, I didn't go all the way to debug the compiled code. So, your analysis is more complete than mine, but we have similar results.

@tjungblu
Copy link
Contributor

I also tried creating a Logger interface with empty functions, but the gain, as you may have found, was marginal.

yep, which is odd. The Go Compiler does know how to devirtualize and inline things, I just can't get the compiler to tell me why it won't do it in this scenario. The nullptr trick indeed works, even though it's quite annoying to have all those null checks.

Another option could be to have an "enabled" flag, similar to how klog works:

if klog.V(2).Enabled() { klog.Info("log this") }

I'll benchmark a bit more...

@ahrtr
Copy link
Member

ahrtr commented Apr 26, 2024

TL; DR

The performance reduce (around 9%) should be caused by type conversion, e.g from []byte to string in this case.

lg.Debugf("Putting key %q", string(key))

objdump

For simplicity, I removed the defer function; so we only print one debug log. We can still reproduce the issue (write performance reduce)

$ git diff
diff --git a/bucket.go b/bucket.go
index 2f1d710..222635e 100644
--- a/bucket.go
+++ b/bucket.go
@@ -447,13 +447,6 @@ func (b *Bucket) Get(key []byte) []byte {
 func (b *Bucket) Put(key []byte, value []byte) (err error) {
        lg := b.tx.db.Logger()
        lg.Debugf("Putting key %q", string(key))
-       defer func() {
-               if err != nil {
-                       lg.Errorf("Putting key %q failed: %v", string(key), err)
-               } else {
-                       lg.Debugf("Putting key %q successfully", string(key))
-               }
-       }()
        if b.tx.db == nil {
                return errors.ErrTxClosed
        } else if !b.Writable() {

Run commands below,

$ go build ./cmd/bbolt
$ go tool objdump ./bbolt > dump.txt

lg.Debugf("Putting key %q", string(key)) 's objdump is as below,

400823   bucket.go:449         0x10017e26c             f0000680                ADRP 864256(PC), R0
400824   bucket.go:449         0x10017e270             910e8000                ADD $928, R0, R0
400825   bucket.go:449         0x10017e274             97fa4a3f                CALL runtime.newobject(SB)
400826   bucket.go:449         0x10017e278             f9004be0                MOVD R0, 144(RSP)
400827   bucket.go:449         0x10017e27c             f9406be1                MOVD 208(RSP), R1
400828   bucket.go:449         0x10017e280             f9406fe2                MOVD 216(RSP), R2
400829   bucket.go:449         0x10017e284             aa1f03e0                MOVD ZR, R0
400830   bucket.go:449         0x10017e288             97fb70da                CALL runtime.slicebytetostring(SB)
400831   bucket.go:449         0x10017e28c             97fa3ea1                CALL runtime.convTstring(SB)
400832   bucket.go:449         0x10017e290             f0000661                ADRP 847872(PC), R1
400833   bucket.go:449         0x10017e294             91130021                ADD $1216, R1, R1
400834   bucket.go:449         0x10017e298             f9404be3                MOVD 144(RSP), R3
400835   bucket.go:449         0x10017e29c             f9000061                MOVD R1, (R3)
400836   bucket.go:449         0x10017e2a0             d000179b                ADRP 3088384(PC), R27
400837   bucket.go:449         0x10017e2a4             b9404361                MOVWU 64(R27), R1
400838   bucket.go:449         0x10017e2a8             340000a1                CBZW R1, 5(PC)
400839   bucket.go:449         0x10017e2ac             97fbd2e1                CALL runtime.gcWriteBarrier2(SB)
400840   bucket.go:449         0x10017e2b0             f9000320                MOVD R0, (R25)
400841   bucket.go:449         0x10017e2b4             f9400466                MOVD 8(R3), R6
400842   bucket.go:449         0x10017e2b8             f9000726                MOVD R6, 8(R25)
400843   bucket.go:449         0x10017e2bc             f9000460                MOVD R0, 8(R3)
400844   bucket.go:449         0x10017e2c0             f9403fe6                MOVD 120(RSP), R6
400845   bucket.go:449         0x10017e2c4             f94010c6                MOVD 32(R6), R6
400846   bucket.go:449         0x10017e2c8             f94043e0                MOVD 128(RSP), R0
400847   bucket.go:449         0x10017e2cc             b00002a1                ADRP 348160(PC), R1
400848   bucket.go:449         0x10017e2d0             910a2821                ADD $650, R1, R1
400849   bucket.go:449         0x10017e2d4             b27f0be2                ORR $14, ZR, R2
400850   bucket.go:449         0x10017e2d8             b24003e4                ORR $1, ZR, R4
400851   bucket.go:449         0x10017e2dc             aa0403e5                MOVD R4, R5
400852   bucket.go:449         0x10017e2e0             d63f00c0                CALL (R6)

If we change the line to below, then can't reproduce the issue anymore. It has the same write performance as release-1.3.

lg.Debugf("Putting key")

Its objump is as below. Obviously it's much simpler. It doesn't execute the operations for type conversion, e.g. CALL runtime.newobject(SB), CALL runtime.slicebytetostring(SB) etc

400821   bucket.go:450         0x10017e264             f9401106                MOVD 32(R8), R6
400822   bucket.go:450         0x10017e268             aa0703e0                MOVD R7, R0
400823   bucket.go:450         0x10017e26c             900002a1                ADRP 344064(PC), R1
400824   bucket.go:450         0x10017e270             91121c21                ADD $1159, R1, R1
400825   bucket.go:450         0x10017e274             d2800162                MOVD $11, R2
400826   bucket.go:450         0x10017e278             aa1f03e3                MOVD ZR, R3
400827   bucket.go:450         0x10017e27c             aa1f03e4                MOVD ZR, R4
400828   bucket.go:450         0x10017e280             aa0403e5                MOVD R4, R5
400829   bucket.go:450         0x10017e284             d63f00c0                CALL (R6)

Solution

Solution 1

One solution is to check lg != nil before writing each log, just as #738 does.

Good side:

  • There is no performance reduce;
  • It's consistent on processing the log.

Bad side:

  • reduces the readability, but not too much;
  • it isn't convenient for contributors, because they need to do the ugly nil check.

We can enhance this solution based on @tjungblu 's comment above something like if lg.Enabled() { ...} to improve the readability a bit.

Solution 2

The second solution is we only optimize for the frequently called methods, e.g. Put, Delete. etc. We can follow the approach as @ivanvc proposed in #720 (comment).

if lg != discardLogger {
    ....
}

Good side:

  • There is no performance reduce
  • A little better readability compared to solution 1

Bad side:

  • Inconsistent way to process log
  • slightly reduces the readability.

Solution 3

Do no make any change, but document the performance impact.

Good side:

  • no any reduce on readability

Bad side:

  • write performance reduce (around 9%). Note that the 9% is only in bbolt's bench test, it should much lower in application's e2e or benchmark test.

bbolt is a library to be used by other applications. The performance reduce might be negligible in application E2E test or production environment. Can we compare the performance in etcd, e.g etcd depending on main vs etcd depending 1.3.9?

@ambaxter @fuweid @ivanvc @tjungblu WDYT? Probably we can go for solution 1.

@tjungblu
Copy link
Contributor

tjungblu commented Apr 26, 2024

I believe what happens with

lg.Debugf("Putting key %q", string(key))

with runtime.newobject(SB) the string escapes to the heap, so you also incur the cost of the GC write barrier. Plus the string conversion.

lg.Debugf("Putting key")

on the other hand is just completely removed from your code, I reckon because the debug flag is off? here the compiler seems to get it :)

... the performance reduce might be negligible in application E2E test or production environment.

given the go compiler can't properly optimize it away, there for sure must be some performance reduction. It might not be noticeable in etcd because of the raft latency and the async writing.

I think Ivan's approach with the nil pointer is fine, but we need to be sure we don't NPE anywhere.

@ivanvc
Copy link
Member

ivanvc commented Apr 28, 2024

As @ahrtr found out in my PR, finding every instance of a call to the logger is exhaustive. It may be better to go with option two and skip logging only in the most impactful functions.

As per an etcd benchmark, I've been running them since yesterday. I'll publish the results tomorrow as soon as I finish running them (but it seems like @tjungblu is correct, and it is negligible).

@ivanvc
Copy link
Member

ivanvc commented Apr 29, 2024

Here's the output of using bbolt 1.3 vs. main by using rw-heatmaps. There is actually some performance loss, but it doesn't look significant.

bbolt_read
bbolt_write

@ahrtr
Copy link
Member

ahrtr commented Apr 29, 2024

Thanks @ivanvc .

If I understood it correctly, when R/W ratio increases, the performance difference also decreases. It aligns with our test with bbolt bench, because the performance reduce is caused by Put.

Based on all the discussion & test so far, we may want to go for solution 2.

  • We only optimize for the most frequently called (or public/exported) method, e.g.
    • [must] Put, Delete, CreateBucket, CreateBucketIfNotExists, DeleteBucket, Begin, Commit,
    • [Optional] Open, MoveBucket, Sync
  • It's less error prone

@ivanvc
Copy link
Member

ivanvc commented Apr 29, 2024

Opened PR #741, that implements solution 2.

@ahrtr
Copy link
Member

ahrtr commented Apr 30, 2024

Thanks @ivanvc

@ambaxter we have just resolved this issue in #741. Could you double check whether you can still reproduce this issue (main vs 1.3.8)?

@ahrtr
Copy link
Member

ahrtr commented May 3, 2024

Please feel free to reopen this issue if anyone can still reproduce it.

@ahrtr ahrtr closed this as completed May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants