Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/backupccl: TestGCDropIndexSpanExpansion failed #75202

Closed
cockroach-teamcity opened this issue Jan 20, 2022 · 13 comments · Fixed by #75491
Closed

ccl/backupccl: TestGCDropIndexSpanExpansion failed #75202

cockroach-teamcity opened this issue Jan 20, 2022 · 13 comments · Fixed by #75491
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-disaster-recovery

Comments

@cockroach-teamcity
Copy link
Member

ccl/backupccl.TestGCDropIndexSpanExpansion failed with artifacts on master @ 46bdd4e299b3324eb65b943dc4a1c10ded7d8a7a:

=== RUN   TestGCDropIndexSpanExpansion
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestGCDropIndexSpanExpansion3544265402
    test_log_scope.go:80: use -show-logs to present logs inline
    backup_test.go:9036: error executing 'BACKUP INTO LATEST IN 'nodelocal://0/foo' WITH revision_history': pq: failed to run backup: exporting 13 ranges: exporting /Table/56/{3-4}: batch timestamp 1642664115.146851995,0 must be after replica GC threshold 1642664116.550345007,0
    panic.go:661: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestGCDropIndexSpanExpansion3544265402
--- FAIL: TestGCDropIndexSpanExpansion (9.85s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-json

/cc @cockroachdb/bulk-io

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 20, 2022
@msbutler
Copy link
Collaborator

test fails while executing the full backup, specifically when the backup processor attempts to export exporting 13 ranges: exporting ‹/Table/56/{3-4}

@cockroach-teamcity
Copy link
Member Author

ccl/backupccl.TestGCDropIndexSpanExpansion failed with artifacts on master @ ebda0ecb4aa1fe47f1403635846e342a2cfbfa1b:

=== RUN   TestGCDropIndexSpanExpansion
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestGCDropIndexSpanExpansion3501611901
    test_log_scope.go:80: use -show-logs to present logs inline
    backup_test.go:9036: error executing 'BACKUP INTO LATEST IN 'nodelocal://0/foo' WITH revision_history': pq: failed to run backup: exporting 13 ranges: exporting /Table/56/{1-2}: batch timestamp 1642719965.243485028,0 must be after replica GC threshold 1642719966.582763509,0
    panic.go:661: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestGCDropIndexSpanExpansion3501611901
--- FAIL: TestGCDropIndexSpanExpansion (9.38s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-json

This test on roachdash | Improve this report!

@adityamaru
Copy link
Contributor

I think this is my fault, I'll take this.

@adityamaru adityamaru self-assigned this Jan 21, 2022
@cockroach-teamcity
Copy link
Member Author

ccl/backupccl.TestGCDropIndexSpanExpansion failed with artifacts on master @ 72655fd4730fadb0ddacd8a5513a5d2d9121cbc6:

=== RUN   TestGCDropIndexSpanExpansion
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestGCDropIndexSpanExpansion1354184770
    test_log_scope.go:80: use -show-logs to present logs inline
    backup_test.go:9036: error executing 'BACKUP INTO LATEST IN 'nodelocal://0/foo' WITH revision_history': pq: failed to run backup: exporting 13 ranges: exporting /Table/56/{1-2}: batch timestamp 1642777676.763932686,0 must be after replica GC threshold 1642777678.034359492,0
    panic.go:661: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestGCDropIndexSpanExpansion1354184770
--- FAIL: TestGCDropIndexSpanExpansion (9.51s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-json

This test on roachdash | Improve this report!

@adityamaru
Copy link
Contributor

adityamaru commented Jan 23, 2022

So turns out this is actually not related to the ongoing PTS work and bisecting it points to #73876. cc:@irfansharif

I'm not exactly sure what behavior has changed so still looking into it.

@adityamaru
Copy link
Contributor

adityamaru commented Jan 23, 2022

So I've narrowed down what is going on a little more. The test was slightly incorrect in that it was setting a short GC TTL on the whole table, instead of just the index that was being dropped (foo@bar). Even after fixing this (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/backup_test.go#L9014), sometimes the last backup seems to issue an export request whose batch timestamp falls below the range's GCThreshold. With or without protected timestamps, this should not be the case, since the default GC TTL is 25 hours.

After the GC job on the dropped index successfully runs GC (https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/backupccl/backup_test.go#L9034), and we run the third backup, the logs show a few interesting things:

We seem to see some range split activity:

I220123 04:23:52.958839 3092 kv/kvserver/replica_command.go:400  [n1,split,s1,r43/1:/{Table/47-Max}] 185  initiating a split of this range at key /Table/56 [r44] (span config)
I220123 04:23:53.069428 3190 kv/kvserver/replica_command.go:400  [n1,split,s1,r44/1:/{Table/56-Max}] 187  initiating a split of this range at key /Table/56/2 [r45] (span config)
I220123 04:23:53.193499 3193 kv/kvserver/replica_command.go:400  [n1,split,s1,r45/1:/{Table/56/2-Max}] 188  initiating a split of this range at key /Table/56/3 [r46] (span config)

In the third backup, when we are about to process the export request to /56/1-/56/2 we evaluate the GCThreshold on range 44 which has a GCTTL of 25hrs as expected.

I220123 04:23:53.240838 3285 kv/kvserver/replica.go:1462  [n1,s1,r44/1:/Table/56{-/2}] 223  this is the replica rangeID: 44, with the GC TTL: 90000.000000
I220123 04:23:53.241628 3285 kv/kvserver/replica.go:1463  [n1,s1,r44/1:/Table/56{-/2}] 226  this is the request Export [/Table/56/1,/Table/56/2), [wait-policy: Error]

However, when we go to evaluate the export request to 56/3-56/4 we evaluate the GCThreshold on range 45 which has a GCTTL of 1 second. Given the above splits should this span not be evaluated on range 46 which has a GCTTL of 25hours?

I220123 04:23:53.241530 3286 kv/kvserver/replica.go:1456  [n1,s1,r45/1:/{Table/56/2-Max}] 225  err: this is the replica rangeID: 45, with the GC TTL: 1.000000
I220123 04:23:53.242519 3286 kv/kvserver/replica.go:1457  [n1,s1,r45/1:/{Table/56/2-Max}] 227  err: this is the request Export [/Table/56/3,/Table/56/4), [wait-policy: Error]
I220123 04:23:53.253385 2886 jobs/registry.go:1093  [n1] 228  BACKUP job 730206806831529985: stepping through state reverting with error: failed to run backup: exporting 13 ranges: exporting /Table/56/{3-4}: batch ti
mestamp 1642911830.736144771,0 must be after replica GC threshold 1642911832.240915245,0

@irfansharif does anything stand out as unexpected in the above logs as a consequence of enabling span configs?

@adityamaru
Copy link
Contributor

I get a sense that what is happening here is similar to #75436, let me try fixing it similarly.

@irfansharif
Copy link
Contributor

irfansharif commented Jan 24, 2022

Do you have a quick repro/branch I can use? It could be a few things:

  • kvserver: avoid clobbering replica conf #75233, which fixes an actual bug.
  • There's lag between a range getting split and the RHS of the range actually applying the config (see snipped below). Not sure if that's what's happening here, or where the GCTTL of 1s is coming from (did the RHS of the range previously have that TTL?)
    conf, err := confReader.GetSpanConfigForKey(ctx, desc.StartKey)
    if err != nil {
    return errors.Wrapf(err, "%s: failed to lookup span config", r)
    }
    r.SetSpanConfig(conf)
  • In execbuilder: deflake TestExecBuild #75304 we saw how a stale distsender cache (not updated with latest splits) ended up flaking the test. I'm not sure which asynchrony exactly contributes to the distsender cache being a bit more stale after *: enable span configs by default #73876, and whether that's a real problem outside of tests, but could that explain why we evaluate the export request in a range other than what we'd expect? Ignoring the GC threshold timestamp error, shouldn't the ExportRequest retry?

@adityamaru
Copy link
Contributor

Yup, if you apply this diff to fix the test it fails under stressrace in ~200 runs on my gceworker.

diff --git a/pkg/ccl/backupccl/backup_test.go b/pkg/ccl/backupccl/backup_test.go
index 9564bbc6ed..11e9c70266 100644
--- a/pkg/ccl/backupccl/backup_test.go
+++ b/pkg/ccl/backupccl/backup_test.go
@@ -9011,7 +9011,7 @@ func TestGCDropIndexSpanExpansion(t *testing.T) {
        sqlRunner.Exec(t, `
 CREATE DATABASE test; USE test;
 CREATE TABLE foo (id INT PRIMARY KEY, id2 INT, id3 INT, INDEX bar (id2), INDEX baz(id3));
-ALTER TABLE foo CONFIGURE ZONE USING gc.ttlseconds = '1';
+ALTER INDEX foo@bar CONFIGURE ZONE USING gc.ttlseconds = '1';
 INSERT INTO foo VALUES (1, 2, 3);
 DROP INDEX foo@bar;
 `)
@@ -9033,6 +9033,8 @@ DROP INDEX foo@bar;
        // Wait for the GC to complete.
        jobutils.WaitForJob(t, sqlRunner, gcJobID)

+       // This backup should succeed since the spans being backed up have a default
+       // GC TTL of 25 hours.
        sqlRunner.Exec(t, `BACKUP INTO LATEST IN 'nodelocal://0/foo' WITH revision_history`)
 }

fwiw I also tried running with closed timestamp interval to 100ms and adding the jobs.TestingKnobsWithShortIntervals() and I can't seem to get it to flake anymore. Not sure if that is masking an underlying issue so holding off on sending a patch with those changes.

Ignoring the GC threshold timestamp error, shouldn't the ExportRequest retry?

I don't think a retry would fix this cause an ExportRequest sets its EarliestActiveTimetsamp() to the StartTime of the backup it is running as a part of. So if the request has failed under the GCThreshold once, it'll fail with this error every time.

@irfansharif
Copy link
Contributor

For my own understanding, is the "gc.ttlseconds = '1'" necessary for this test?

@adityamaru
Copy link
Contributor

Yup, the GC job responsible for clearing the dropped index waits until the TTL has passed before actually running GC. So 1 second just ensures that the job completes in a reasonable timeframe.

@irfansharif
Copy link
Contributor

Confirming it repros reliably:

$ dev test pkg/ccl/backupccl -f=TestGCDropIndexSpanExpansion -v --show-logs --stress --race
240 runs so far, 0 failures, over 5m0s
240 runs so far, 0 failures, over 5m5s
242 runs so far, 0 failures, over 5m10s
..
249 runs completed, 1 failures, over 5m17s

Running the test's (modified) SQL code locally to remind myself of SQL encoded partitions and the relevant configs:

CREATE DATABASE test; USE test;
CREATE TABLE foo (id INT PRIMARY KEY, id2 INT, id3 INT, INDEX bar (id2), INDEX baz(id3));
ALTER INDEX foo@bar CONFIGURE ZONE USING gc.ttlseconds = '1';
INSERT INTO foo VALUES (1, 2, 3);
DROP INDEX foo@bar;
SELECT 
  crdb_internal.pretty_key(start_key, -1), crdb_internal.pretty_key(end_key, -1),
  crdb_internal.pb_to_json('cockroach.roachpb.SpanConfig', config)->'gcPolicy'->>'ttlSeconds'
FROM system.span_configurations;
    crdb_internal.pretty_key    | crdb_internal.pretty_key | ?column?
--------------------------------+--------------------------+-----------
  /Min                          | /System/NodeLiveness     | 3600
  [...]
  /Table/56                     | /Table/56/2              | 90000
  /Table/56/2                   | /Table/56/3              | 1
  /Table/56/3                   | /Table/57                | 90000
(50 rows)
  • [/Table/56, /Table/56/2) covers the key range from start of the table to start of INDEX foo@bar (exclusive);
  • [/Table/56/2, /Table/56/3) covers the foo@bar index;
  • [/Table/56/3, /Table/57) covers the rest of the table;

Adding some instrumentation, I'm seeing the following.

I220124 23:13:03.887953 2855 kv/kvserver/pkg/kv/kvserver/replica_command.go:400  [n1,split,s1,r43/1:/{Table/47-Max}] 989  initiating a split of this range at key /Table/56 [r44] (span config)
I220124 23:13:03.989979 190 kv/kvserver/pkg/kv/kvserver/replica.go:771  [n1,s1,r44/1:/{Table/56-Max}] 990  DEBUG: set span config with gc.ttl_seconds = 90000
I220124 23:13:03.997361 2960 kv/kvserver/pkg/kv/kvserver/replica_command.go:400  [n1,split,s1,r44/1:/{Table/56-Max}] 991  initiating a split of this range at key /Table/56/2 [r45] (span config)
I220124 23:13:04.049992 232 kv/kvserver/pkg/kv/kvserver/replica.go:771  [n1,s1,r45/1:/{Table/56/2-Max}] 992  DEBUG: set span config with gc.ttl_seconds = 1
I220124 23:13:04.061052 2862 kv/kvserver/pkg/kv/kvserver/replica_command.go:400  [n1,split,s1,r45/1:/{Table/56/2-Max}] 993  initiating a split of this range at key /Table/56/3 [r46] (span config)
...
I220124 23:13:04.135694 3330 ccl/backupccl/pkg/ccl/backupccl/backup_processor.go:405  [n1,job=730711928388419585] 1005  sending ExportRequest for span /Table/56/{1-2} (attempt 1, priority normal)
I220124 23:13:04.139194 2463 ccl/backupccl/pkg/ccl/backupccl/backup_processor.go:405  [n1,job=730711928388419585] 1006  sending ExportRequest for span /Table/56/{3-4} (attempt 1, priority normal)
E220124 23:13:04.144829 2463 kv/kvserver/pkg/kv/kvserver/replica.go:1456  [n1,s1,r45/1:/{Table/56/2-Max}] 1007  DEBUG: batch timestamp 1643065981.627201349,0 must be after replica GC threshold 1643065983.144720141,0
I220124 23:13:04.156378 3173 jobs/registry.go:1093  [n1] 1008  BACKUP job 730711928388419585: stepping through state reverting with error: failed to run backup: exporting 13 ranges: exporting /Table/56/{3-4}: batch timestamp 1643065981.627201349,0 must be after replica GC threshold 1643065983.144720141,0
...
I220124 23:13:04.183219 230 kv/kvserver/pkg/kv/kvserver/replica.go:771  [n1,s1,r46/1:/{Table/56/3-Max}] 1009  DEBUG: set span config with gc.ttl_seconds = 90000

Specifically we go through the following set of range splits:

  • r43/1:/{Table/47-Max}
  • splits off r44/1:/{Table/56-Max}
  • splits off r45/1:/{Table/56/2-Max}
  • splits off r46/1:/{Table/56/3-Max}

When evaluating the export request ("ExportRequest for span /Table/56/{3-4}"), we see we're actually hitting r45 before it's fully split off r46 with the "default" TTL. Specifically this sequence logs, in order:

I220124 23:13:04.061052 2862 kv/kvserver/pkg/kv/kvserver/replica_command.go:400  [n1,split,s1,r45/1:/{Table/56/2-Max}] 993  initiating a split of this range at key /Table/56/3 [r46] (span config)
...
I220124 23:13:04.139194 2463 ccl/backupccl/pkg/ccl/backupccl/backup_processor.go:405  [n1,job=730711928388419585] 1006  sending ExportRequest for span /Table/56/{3-4} (attempt 1, priority normal)
...
E220124 23:13:04.144829 2463 kv/kvserver/pkg/kv/kvserver/replica.go:1456  [n1,s1,r45/1:/{Table/56/2-Max}] 1007  DEBUG: batch timestamp 1643065981.627201349,0 must be after replica GC threshold 1643065983.144720141,0
...
I220124 23:13:04.183219 230 kv/kvserver/pkg/kv/kvserver/replica.go:771  [n1,s1,r46/1:/{Table/56/3-Max}] 1009  DEBUG: set span config with gc.ttl_seconds = 90000

This is basically just #71977 which has been true for CRDB for time immemorial; since we're processing the span configs one a time, splitting of ranges one at a time at the end of the key range, when initially carving out r45 to contain the index's data, it we're first splitting on the left side of the key range, creating a range that encompasses keys after the index's data (r45/1:/{Table/56/2-Max}) but temporarily applying just the index's config. When evaluating an export request addressed to {Table/56/3-Max}, if evaluated on r45 before it splits into the more specific r46 (with the lower GC TTL also applied), we'll run into the batch errors above. Since #73876 introduces a bit more asynchrony than before, I believe these range splits are now concurrent with the BACKUP itself; this wasn't true before (the splits would've been processed earlier). What this means for the test is similar to #75436, either retrying the export or having the test explicitly wait for these range boundaries/configs to form. Do you mind if I leave this on your plate to do? I think it should be easy, especially given it's reproducible.

@cockroach-teamcity
Copy link
Member Author

ccl/backupccl.TestGCDropIndexSpanExpansion failed with artifacts on master @ 421a767fa41a67efa157a96be8b8e43cdbc55b63:

=== RUN   TestGCDropIndexSpanExpansion
    test_log_scope.go:79: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestGCDropIndexSpanExpansion1403063623
    test_log_scope.go:80: use -show-logs to present logs inline
    backup_test.go:9036: error executing 'BACKUP INTO LATEST IN 'nodelocal://0/foo' WITH revision_history': pq: failed to run backup: exporting 13 ranges: exporting /Table/56/{1-2}: batch timestamp 1643070671.926476161,0 must be after replica GC threshold 1643070673.308443057,0
    panic.go:661: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestGCDropIndexSpanExpansion1403063623
--- FAIL: TestGCDropIndexSpanExpansion (9.45s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-json

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Jan 26, 2022
75491: backupccl: deflake TestGCDropIndexSpanExpansion r=irfansharif a=adityamaru

With #73876 there is a bit more asynchrony than before and
thus the test must wait until all the ranges have completed splitting
before it attempts the last backup, so that the ExportRequest targets
the range with the correct SpanConfig applied to it.

Fixes: #75202

Release note: None

Co-authored-by: Aditya Maru <[email protected]>
@craig craig bot closed this as completed in 69cf0d4 Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-disaster-recovery
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants