Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachperf: regression in kv on 2023/03/10 - multiple benchmarks #98571

Closed
smg260 opened this issue Mar 14, 2023 · 6 comments · Fixed by #98630
Closed

roachperf: regression in kv on 2023/03/10 - multiple benchmarks #98571

smg260 opened this issue Mar 14, 2023 · 6 comments · Fixed by #98630
Assignees
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team

Comments

@smg260
Copy link
Contributor

smg260 commented Mar 14, 2023

Significant drop on or around March 10th.
image

Affects multiple kv0 and kv95

Jira issue: CRDB-25340

@smg260 smg260 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-roachperf-investigation Issue opened as a result of roachperf triage T-testeng TestEng Team labels Mar 14, 2023
@smg260 smg260 self-assigned this Mar 14, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 14, 2023

cc @cockroachdb/test-eng

@smg260
Copy link
Contributor Author

smg260 commented Mar 14, 2023

Bisection results using kv0/enc=false/nodes=1:

2023-03-14T14:06:32Z   Bisecting regression in [kv0/enc=false/nodes=1] using commit range [4dc9e98d8cc (known good),d4a584e49f0 (known bad)]
2023-03-14T14:06:32Z   Thresholds [good >= 10643, bad <= 6599]
2023-03-14T14:23:01Z   [6f444d4f841] median ops/s: [9397]. User marked as good. Threshold updated.
2023-03-14T14:34:05Z   [eb0ba1123bf] median ops/s: [7182]. User marked as bad. Threshold updated.
2023-03-14T14:46:44Z   [25ecce9c598] median ops/s: [9756]. Auto marked as good.
2023-03-14T15:00:14Z   [e4ceb499e9b] median ops/s: [10248]. Auto marked as good.
2023-03-14T15:15:13Z   [42be60f296a] median ops/s: [10066]. Auto marked as good.
2023-03-14T15:45:18Z   [5705e7c358f] median ops/s: [9302]. User marked as good. Threshold updated.
2023-03-14T15:58:48Z   [86a5852c253] median ops/s: [6200]. Auto marked as bad.
2023-03-14T15:58:48Z   Bisection complete. Suspect commit:
commit 86a5852c2531428a493231f26786c886327f30e3
Author: Nathan VanBenschoten <[email protected]>
Date:   Sun Mar 5 15:46:20 2023 -0500

    kv: async resolve unreplicated locks on other ranges after one-phase commit

    Fixes #94400.

    This commit fixes async lock resolution on external ranges after a transaction
    exercises the one-phase commit fast-path. Before the change, the 1PC fast-path
    was failing to handle the case where a transaction had acquired locks on ranges
    other than the range with the transaction record. As a result, these locks were
    not asynchronously but eagerly cleaned up. Conflicting transactions that
    encountered these locks would then wait 50ms before pushing the lock holder,
    finding it to be committed, and removing the locks themselves.

    It's fairly rare for a transaction to acquire locks on multiple ranges but then
    still be able to perform a 1PC commit. Regardless, we should handle such cases.

    Release note (bug fix): Fixed a bug where transactions that performed a SELECT
    FOR UPDATE across multiple ranges but never performed writes could fail to
    eagerly clean up their locks after commit. Future transactions that encountered
    these abandoned locks could be delayed for 50ms before unlocking them.

#98044

@smg260 smg260 added the T-kv KV Team label Mar 14, 2023
@erikgrinaker
Copy link
Contributor

@nvanbenschoten ^

@smg260 smg260 removed their assignment Mar 14, 2023
@nvanbenschoten nvanbenschoten self-assigned this Mar 14, 2023
@nvanbenschoten
Copy link
Member

Thanks @smg260 for determining the cause of the regression. I can see it in microbenchmarks as well:

$ benchdiff --new=86a5852c2531428a493231f26786c886327f30e3 --post-checkout='dev generate go' --run='BenchmarkKV/Insert/Native' ./pkg/sql/tests
test binaries already exist for '2d87998'; skipping build
test binaries already exist for '86a5852'; skipping build

  pkg=1/1 iter=10/10 cockroachdb/cockroach/pkg/sql/tests \

name                            old time/op    new time/op    delta
KV/Insert/Native/rows=10-30        201µs ± 2%     221µs ± 2%   +9.98%  (p=0.000 n=10+10)
KV/Insert/Native/rows=100-30       601µs ± 1%     677µs ± 1%  +12.69%  (p=0.000 n=10+10)
KV/Insert/Native/rows=1-30         150µs ± 3%     171µs ± 2%  +13.80%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10000-30    38.1ms ± 3%    44.4ms ± 3%  +16.67%  (p=0.000 n=10+10)
KV/Insert/Native/rows=1000-30     3.82ms ± 3%    4.49ms ± 2%  +17.62%  (p=0.000 n=10+10)

name                            old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1-30        18.4kB ± 1%    23.7kB ± 1%  +28.88%  (p=0.000 n=9+10)
KV/Insert/Native/rows=10-30       43.8kB ± 1%    66.3kB ± 1%  +51.59%  (p=0.000 n=10+9)
KV/Insert/Native/rows=10000-30    35.0MB ± 2%    55.3MB ± 1%  +58.11%  (p=0.000 n=10+10)
KV/Insert/Native/rows=100-30       292kB ± 1%     485kB ± 1%  +66.28%  (p=0.000 n=9+9)
KV/Insert/Native/rows=1000-30     2.80MB ± 1%    4.81MB ± 1%  +71.85%  (p=0.000 n=10+10)

name                            old allocs/op  new allocs/op  delta
KV/Insert/Native/rows=1-30           144 ± 1%       194 ± 1%  +35.28%  (p=0.000 n=8+10)
KV/Insert/Native/rows=10-30          302 ± 1%       418 ± 1%  +38.37%  (p=0.000 n=10+9)
KV/Insert/Native/rows=100-30       1.66k ± 1%     2.40k ± 0%  +44.34%  (p=0.000 n=9+9)
KV/Insert/Native/rows=10000-30      148k ± 0%      215k ± 0%  +45.15%  (p=0.000 n=9+10)
KV/Insert/Native/rows=1000-30      15.0k ± 0%     21.9k ± 0%  +46.14%  (p=0.000 n=10+10)

@nvanbenschoten
Copy link
Member

This looks to be due to a misunderstanding of how kvserverbase.IntersectSpan works. The utility does not attempt to handle point keys correctly. Instead, it considers all point keys to be outside of the descriptor's key range, even if they are contained by the descriptor's range:

func IntersectSpan(
span roachpb.Span, desc *roachpb.RangeDescriptor,
) (middle *roachpb.Span, outside []roachpb.Span) {
start, end := desc.StartKey.AsRawKey(), desc.EndKey.AsRawKey()
if len(span.EndKey) == 0 {
outside = append(outside, span)
return
}

As a result, 1PC txns were launching async intent resolution tasks for local point lock spans.

The following diff resolves the regression:

diff --git a/pkg/kv/kvserver/kvserverbase/base.go b/pkg/kv/kvserver/kvserverbase/base.go
index 09ed15547ae..338b12d1b6c 100644
--- a/pkg/kv/kvserver/kvserverbase/base.go
+++ b/pkg/kv/kvserver/kvserverbase/base.go
@@ -164,6 +164,9 @@ func IntersectSpan(
 ) (middle *roachpb.Span, outside []roachpb.Span) {
        start, end := desc.StartKey.AsRawKey(), desc.EndKey.AsRawKey()
        if len(span.EndKey) == 0 {
+               if ContainsKey(desc, span.Key) {
+                       return &span, nil
+               }
                outside = append(outside, span)
                return
        }

The non-1PC EndTxn path also uses this utility function, but only for ranged intent spans.

@nvanbenschoten nvanbenschoten added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 14, 2023
@blathers-crl
Copy link

blathers-crl bot commented Mar 14, 2023

Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this release-blocker affects.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nvanbenschoten nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Mar 14, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 14, 2023
Fixes cockroachdb#98571.

This commit fixes the regression detected in cockroachdb#98571. In that issue, we saw that
the bug fix in 86a5852 (cockroachdb#98044) caused a regression in kv0 (and other benchmarks).
This was due to a bug in `kvserverbase.IntersectSpan`, which was considering
local point spans to be external to the provided range span.

This commit fixes the bug by not calling `kvserverbase.IntersectSpan` for point lock
spans.

The commit also makes the utility panic instead of silently returning incorrect results.
There's an existing TODO on the utility to generalize it. For now, we just make it harder
to misuse.

Finally, we add a test that asserts against the presence of async intent resolution after
one-phase commits when external intent resolution is not needed.

```
name                            old time/op    new time/op    delta
KV/Insert/Native/rows=1-10        61.2µs ± 3%    48.9µs ± 3%  -20.10%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10-10       93.3µs ±15%    76.2µs ± 3%  -18.34%  (p=0.000 n=9+9)
KV/Insert/Native/rows=1000-10     2.84ms ±12%    2.42ms ± 4%  -14.97%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       365µs ± 5%     320µs ± 8%  -12.40%  (p=0.000 n=10+9)
KV/Insert/Native/rows=10000-10    27.6ms ± 6%    24.4ms ± 3%  -11.53%  (p=0.000 n=9+9)

name                            old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1000-10     4.66MB ± 1%    2.76MB ± 1%  -40.89%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       478kB ± 1%     287kB ± 1%  -39.90%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10000-10    54.2MB ± 2%    34.3MB ± 3%  -36.73%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10       64.2kB ± 1%    42.1kB ± 1%  -34.39%  (p=0.000 n=10+9)
KV/Insert/Native/rows=1-10        22.1kB ± 1%    17.3kB ± 1%  -21.56%  (p=0.000 n=9+10)

name                            old allocs/op  new allocs/op  delta
KV/Insert/Native/rows=1000-10      21.5k ± 0%     14.7k ± 0%  -31.70%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10000-10      212k ± 0%      146k ± 0%  -31.31%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-10       2.34k ± 1%     1.61k ± 0%  -31.31%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10          392 ± 1%       276 ± 0%  -29.59%  (p=0.000 n=8+8)
KV/Insert/Native/rows=1-10           173 ± 1%       123 ± 0%  -29.04%  (p=0.000 n=9+8)
```

Release note: None
craig bot pushed a commit that referenced this issue Mar 15, 2023
98544: colmem: allow Allocator max batch size to be customized r=cucaroach a=cucaroach

Previously this was hardcoded to coldata.BatchSize or 1024, now it
can be increased or decreased.

Epic: CRDB-18892
Informs: #91831
Release note: None


98630: kv: don't perform async intent resolution on 1PC with point lock spans r=arulajmani a=nvanbenschoten

Fixes #98571.

This commit fixes the regression detected in #98571. In that issue, we saw that the bug fix in 86a5852 (#98044) caused a regression in kv0 (and other benchmarks). This was due to a bug in `kvserverbase.IntersectSpan`, which was considering local point spans to be external to the provided range span.

This commit fixes the bug by not calling `kvserverbase.IntersectSpan` for point lock spans.

The commit also makes the utility panic instead of silently returning incorrect results. There's an existing TODO on the utility to generalize it. For now, we just make it harder to misuse.

Finally, we add a test that asserts against the presence of async intent resolution after one-phase commits when external intent resolution is not needed.

```
name                            old time/op    new time/op    delta
KV/Insert/Native/rows=1-10        61.2µs ± 3%    48.9µs ± 3%  -20.10%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10-10       93.3µs ±15%    76.2µs ± 3%  -18.34%  (p=0.000 n=9+9)
KV/Insert/Native/rows=1000-10     2.84ms ±12%    2.42ms ± 4%  -14.97%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       365µs ± 5%     320µs ± 8%  -12.40%  (p=0.000 n=10+9)
KV/Insert/Native/rows=10000-10    27.6ms ± 6%    24.4ms ± 3%  -11.53%  (p=0.000 n=9+9)

name                            old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1000-10     4.66MB ± 1%    2.76MB ± 1%  -40.89%  (p=0.000 n=9+9)
KV/Insert/Native/rows=100-10       478kB ± 1%     287kB ± 1%  -39.90%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10000-10    54.2MB ± 2%    34.3MB ± 3%  -36.73%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10       64.2kB ± 1%    42.1kB ± 1%  -34.39%  (p=0.000 n=10+9)
KV/Insert/Native/rows=1-10        22.1kB ± 1%    17.3kB ± 1%  -21.56%  (p=0.000 n=9+10)

name                            old allocs/op  new allocs/op  delta
KV/Insert/Native/rows=1000-10      21.5k ± 0%     14.7k ± 0%  -31.70%  (p=0.000 n=8+9)
KV/Insert/Native/rows=10000-10      212k ± 0%      146k ± 0%  -31.31%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-10       2.34k ± 1%     1.61k ± 0%  -31.31%  (p=0.000 n=10+10)
KV/Insert/Native/rows=10-10          392 ± 1%       276 ± 0%  -29.59%  (p=0.000 n=8+8)
KV/Insert/Native/rows=1-10           173 ± 1%       123 ± 0%  -29.04%  (p=0.000 n=9+8)
```

Release note: None

Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 3300ba2 Mar 15, 2023
@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-roachperf-investigation Issue opened as a result of roachperf triage branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants