Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: cdc/mixed-versions failed #106878

Closed
cockroach-teamcity opened this issue Jul 15, 2023 · 7 comments
Closed

roachtest: cdc/mixed-versions failed #106878

cockroach-teamcity opened this issue Jul 15, 2023 · 7 comments
Assignees
Labels
A-cdc Change Data Capture branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cdc
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 15, 2023

roachtest.cdc/mixed-versions failed with artifacts on release-22.2 @ 100a4aa3f590ac3989b3cc5a172996afaa9de862:

test artifacts and logs in: /artifacts/cdc/mixed-versions/run_1
(test_runner.go:985).runTest: test timed out (30m0s)

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-29749

Epic CRDB-11732

@cockroach-teamcity cockroach-teamcity added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 15, 2023
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jul 15, 2023
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label Jul 15, 2023
@renatolabs renatolabs added T-cdc and removed T-testeng TestEng Team labels Jul 17, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jul 17, 2023

cc @cockroachdb/cdc

@blathers-crl blathers-crl bot added the A-cdc Change Data Capture label Jul 17, 2023
@renatolabs
Copy link
Collaborator

@cockroachdb/cdc Could you take a look at this failure, please?

@jayshrivastava
Copy link
Contributor

At ~8:40 we get stuck waiting for resolved timestamps here after rolling back the nodes.

tester.waitForResolvedTimestamps(),

08:40:47 versionupgrade.go:200: test status: versionUpgradeTest: starting step 12
08:40:47 mixed_version_cdc.go:163: test status: waiting for 5 resolved timestamps
08:41:04 mixed_version_cdc.go:274: 11 resolved timestamps validated, latest is 1m52.957808509s behind realtime
08:41:04 mixed_version_cdc.go:177: 1 of 5 timestamps resolved

The teardown is initiated at ~9:02

teardown: 09:02:16 test_runner.go:1035: [w12] dumped stacks to __stacks

Looks like the changefeed job failed at around ~8:37.

882572078533705731	CHANGEFEED	CREATE CHANGEFEED FOR TABLE bank.bank INTO 'kafka://10.150.0.142:9092' WITH resolved = '10s', updated		root	{171}	failed	NULL	2023-07-15 08:33:43.575977	2023-07-15 08:33:43.597976	2023-07-15 08:40:44.57881	2023-07-15 08:40:44.546961	NULL	1689410351823428620.0000000000	unable to dial n1: breaker open	NULL	2675257580918454187	2023-07-15 08:40:44.560528	2023-07-15 08:41:14.560528	1	"{""running execution from '2023-07-15 08:33:43.597976' to '2023-07-15 08:37:14.997527' on 3 failed: could not register flowID {[38 79 173 244 38 254 74 58 184 112 115 40 60 254 156 37]} because the registry is draining""}"	"[{""executionEndMicros"": ""1689410234997527"", ""executionStartMicros"": ""1689410023597976"", ""instanceId"": 3, ""status"": ""running"", ""truncatedError"": ""could not register flowID {[38 79 173 244 38 254 74 58 184 112 115 40 60 254 156 37]} because the registry is draining""}]"

We log a permanent changefeed shutdown at ~8:40 as well

./logs/2.unredacted/cockroach.teamcity-10921274-1689399618-34-n5cpu4-0002.ubuntu.2023-07-15T08_36_35Z.013474.log:I230715 08:40:04.181666 3482 ccl/changefeedccl/changefeed_stmt.go:1030 ⋮ [n2,job=‹CHANGEFEED id=882572078533705731›] 308  CHANGEFEED 882572078533705731 shutting down (cause: cannot acquire lease when draining)

This code path may be relevant here. Maybe we tried to mark this error as retryable on the job level.

if lm.IsDraining() {
// This node is being drained. It's safe to propagate this error (to the
// job registry) since job registry should not be able to commit this error
// to the jobs table; but to be safe, make sure this error is marked as jobs
// retryable error to ensure that some other node retries this changefeed.
return jobs.MarkAsRetryJobError(cause)
}

Draining errors should be marked as retryable in one way or another, but it seems that retries did not happen in this case.

@miretskiy
Copy link
Contributor

miretskiy commented Jul 17, 2023

The shutting down error is correct -- and the log messages seem to indicate that nothing can be persisted to the jobs table (i.e. mark changefeed failed, etc)... So, the changefeed ought to become unclaimed, and eventually, get claimed by another node.

To be clear: i don't think this error is permanent

@miretskiy
Copy link
Contributor

It is strange though...

@jayshrivastava
Copy link
Contributor

the log messages seem to indicate that nothing can be persisted to the jobs table (i.e. mark changefeed failed, etc)

Which log messages are you referring to? I can look into it. It looks like the the job status was set to failed in the middle of the test. I'm looking for the reason why. Nothing should put it in that state during the test, and nothing can bring it out of that state.

@jayshrivastava
Copy link
Contributor

Adding some notes from the discussion yesterday:

The node draining errors were common in 22.1 and are considered to be fixed in 22.2 onwards by treating all errors as retryable (I believe by this PR: #90810).

jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jul 21, 2023
This test may flake due to the upgrade from 22.1->22.2. The
test asserts a changefeed remains running by checking for
resolved timestamps being emitted on a regular basis. The
problem with this is that, during the rolling upgrade,
the changefeed may fail with a "draining" error.
This issue is fixed in 22.2 onwards by treating all errors
as retryable.

Rather than skipping this test because 22.1 is EOLed, it is
preferable to still run this test regularly because it tests
22.2 functionality. This change adds a fix where the test
will poll the changefeed every 1s and recreate it if it fails.

Closes: cockroachdb#106878
Release note: None
Epic: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jul 24, 2023
This test may flake due to the upgrade from 22.1->22.2. The
test asserts a changefeed remains running by checking for
resolved timestamps being emitted on a regular basis. The
problem with this is that, during the rolling upgrade,
the changefeed may fail with a "draining" error.
This issue is fixed in 22.2 onwards by treating all errors
as retryable.

Rather than skipping this test because 22.1 is EOLed, it is
preferable to still run this test regularly because it tests
22.2 functionality. This change adds a fix where the test
will poll the changefeed every 1s and recreate it if it fails.

Closes: cockroachdb#106878
Release note: None
Epic: None
jayshrivastava added a commit to jayshrivastava/cockroach that referenced this issue Jul 25, 2023
This test may flake due to the upgrade from 22.1->22.2. The
test asserts a changefeed remains running by checking for
resolved timestamps being emitted on a regular basis. The
problem with this is that, during the rolling upgrade,
the changefeed may fail with a "draining" error.
This issue is fixed in 22.2 onwards by treating all errors
as retryable.

Rather than skipping this test because 22.1 is EOLed, it is
preferable to still run this test regularly because it tests
22.2 functionality. This change adds a fix where the test
will poll the changefeed every 1s and recreate it if it fails.

Closes: cockroachdb#106878
Release note: None
Epic: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-cdc
Projects
None yet
Development

No branches or pull requests

4 participants