changefeedccl: Rework error handling #90810

miretskiy · 2022-10-27T22:04:19Z

Prior to this PR, changefeeds would rely on a white list
approach in order to determine which errors were retryable.
All other errors would be deemed terminal, causing changefeed
to fail.

The above approach is brittle, and causes unwanted
changefeed termination.

This PR changes this approach to treat all errors as retryable,
unless otherwise indicated. Errors that are known by changefeed
to be fatal are handled explicitly, by marking such errors
as terminal. For example, changefeeds would exit
if the targeted table is dropped. On the other hand, inability
to read this table for any reason would not be treated as
terminal.

Fixes #90320
Fixes #77549
Fixes #63317
Fixes #71341
Fixes #73016
Informs CRDB-6788
Informs CRDB-7581

Release Note (enterprise change): Changefeed will now treat
all errors, unless otherwise indicated, as retryable errors.

cockroach-teamcity · 2022-10-27T22:04:30Z

This change is

HonoreDB

Thanks for doing this, I agree it's a good idea. I do think if we're defaulting to retryable, we need to make that conditional on the feed having already checkpointed at least once. Otherwise, too many problems that require manual intervention will manifest as a changefeed that is always running but never makes progress. A changefeed that's never made progress isn't worth saving--the rare case that you get a transient error right away isn't worth it.

Reviewed 4 of 4 files at r1, 5 of 23 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

pkg/ccl/changefeedccl/avro.go line 412 at r2 (raw file):

				date := *d.(*tree.DDate)
				if !date.IsFinite() {
					return nil, changefeedbase.WithTerminalError(errors.Errorf(

Might be better to have all errors raised out of an encoder be marked terminal unless marked retryable, the only retryable error I can think of that can come out of an encoder is a schema registry connectivity failure.

pkg/ccl/changefeedccl/changefeed_processors.go line 1042 at r2 (raw file):

				if cf.frontier.boundaryType == jobspb.ResolvedSpan_EXIT {
					err = changefeedbase.WithTerminalError(errors.Wrapf(err,
						"shut down due to schema change and %s=%q",

Is this always a schema change or can it be end time/initial scan only?

pkg/ccl/changefeedccl/retry.go line 51 at r2 (raw file):

const resetRetryAfter = 10 * time.Minute

// Retry is a think wrapper around retry.Retry which

thin?

pkg/sql/catalog/lease/lease.go line 1019 at r2 (raw file):

}

// IsDraining returns true if this nodes lease manager is draining.

node's

miretskiy

Hmm... I'm not sure I agree with this. I'm all for detecting errors, doing sanity checking during
changefeed creation. The reality is that changefeed may run for a while w/out checkpoint -- i.e. during initial scan.
Having some arbitrary logic that effectively says that "we will retry, but only if changefeed ran for 10 or more minutes" seems
to be rather arbitrary and not that great either. Lots of things can happen in the cluster during that time.

None of those heuristics would replace the need to proper monitoring and observability.
So.. I'd rather improve those areas if needed, vs above.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB)

pkg/ccl/changefeedccl/avro.go line 412 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Might be better to have all errors raised out of an encoder be marked terminal unless marked retryable, the only retryable error I can think of that can come out of an encoder is a schema registry connectivity failure.

These are the only two cases that exist today where this happpens. If we get into situation where we are returning majority of those errors -- then perhaps we can reconsider.
Do you feel strongly?

pkg/ccl/changefeedccl/changefeed_processors.go line 1042 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Is this always a schema change or can it be end time/initial scan only?

I believe so; we check for endTime.Empty above or if we haven't reached end time

pkg/ccl/changefeedccl/retry.go line 51 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

thin?

Done.

pkg/sql/catalog/lease/lease.go line 1019 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

node's

Done.

HonoreDB

Should we really go for 10 minutes without an initial scan checkpoint? It feels like there's some kind of missing communication, anyway, if it's ever possible for a healthy changefeed to make progress for 10 minutes without reporting it. Maybe we should, say, have special-case logic that if you're a changefeed aggregator on the same node as the changefeed frontier/coordinator, you immediately flush and checkpoint (or initial-scan-checkpoint) after your first successful emit.

Reviewed 1 of 16 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

pkg/ccl/changefeedccl/avro.go line 412 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

These are the only two cases that exist today where this happpens. If we get into situation where we are returning majority of those errors -- then perhaps we can reconsider.
Do you feel strongly?

I guess if it's just these two and we have the possibility of retryable schema registry errors, this seems fine as is.

miretskiy · 2022-11-02T13:53:05Z

Should we really go for 10 minutes without an initial scan checkpoint?

Yeah -- absolutely; initial scan checkpoints are expensive, and, by default are set to happen every 10 minutes.

miretskiy · 2022-11-04T12:17:05Z

bors r+

craig · 2022-11-04T12:57:06Z

Build failed:

Bazel Essential CI (Cockroach)

miretskiy · 2022-11-04T14:55:46Z

bors r+

craig · 2022-11-04T15:04:16Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-11-04T15:08:42Z

Build failed:

Bazel Essential CI (Cockroach)

miretskiy · 2022-11-04T16:09:07Z

bors r+

craig · 2022-11-04T16:27:25Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-11-06T23:08:10Z

🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

craig · 2022-11-06T23:57:01Z

Build succeeded:

Bazel Essential CI (Cockroach)

samiskin · 2022-11-28T14:31:03Z

@miretskiy do we plan on backporting this change to 22.1?

miretskiy · 2022-11-28T14:31:33Z

22.1? Probably not. 22.2.x -- maybe?

amruss · 2022-11-29T17:39:23Z

Let me know if we change our minds on backporting this to 22.2.1 -> going to include it in the docs as a breaking change

miretskiy · 2022-11-30T23:30:52Z

blathers backport 22.2

blathers-crl · 2022-11-30T23:30:56Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 86fffa9 to blathers/backport-release-22.2-90810: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

Prior PR cockroachdb#90810 changed error handling approach used by CDC to be "retry by default". The above PR failed to account for some of the situations that may arrise in mixed version state. In particular, when upgrading, if the aggregator node gets restarted with the new version, while the coordinator still runs old binary, *and* the new aggregator encounters a retryable error, that error would not be explicitly marked as retryable, causing the old coordinator binary to treat the error as a terminal error. Other combination (new coordinator, old aggregator) is not succeptible to this situation. This PR partially reverts changes in the cockroachdb#90810 so previously retryable errors continue to be explicitly marked as retryable. There is no need to introduce version gates since the error handling should be backward compatible, and with this PR, operate correctly in the mixed version state. Epic: None Release note: none

93072: backupccl: speed up listing in some backup callsites r=msbutler a=adityamaru There were a couple of calls to `List` that would list all the files in the backup destination bucket. In the presence of a large number of files, usually SST files in the `data/` folder, this listing could be very slow. This could manifest as a RESTORE command hanging during the planning phase when we go to resolve encryption information from the base backup bucket. To fix this, we add a `delimiter` to the `List` call that will make skipping over all files in the `data/` directory much cheaper. Informs: https://github.com/cockroachlabs/support/issues/1937 Release note (bug fix): speedup slow listing calls that could manifest as restore queries hanging during execution, in the presence of several backup files 93155: changefeedccl: Fix error handling in mixed version state r=miretskiy a=miretskiy Prior PR #90810 changed error handling approach used by CDC to be "retry by default". The above PR failed to account for some of the situations that may arrise in mixed version state. In particular, when upgrading, if the aggregator node gets restarted with the new version, while the coordinator still runs old binary, *and* the new aggregator encounters a retryable error, that error would not be explicitly marked as retryable, causing the old coordinator binary to treat the error as a terminal error. Other combination (new coordinator, old aggregator) is not succeptible to this situation. This PR partially reverts changes in the #90810 so previously retryable errors continue to be explicitly marked as retryable. There is no need to introduce version gates since the error handling should be backward compatible, and with this PR, operate correctly in the mixed version state. Epic: None Release note: none Co-authored-by: adityamaru <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]>

Prior PR cockroachdb#90810 changed error handling approach used by CDC to be "retry by default". The above PR failed to account for some of the situations that may arrise in mixed version state. In particular, when upgrading, if the aggregator node gets restarted with the new version, while the coordinator still runs old binary, *and* the new aggregator encounters a retryable error, that error would not be explicitly marked as retryable, causing the old coordinator binary to treat the error as a terminal error. Other combination (new coordinator, old aggregator) is not succeptible to this situation. This PR partially reverts changes in the cockroachdb#90810 so previously retryable errors continue to be explicitly marked as retryable. There is no need to introduce version gates since the error handling should be backward compatible, and with this PR, operate correctly in the mixed version state. Epic: None Release note: none

Changefeeds utilize protected timestamp (PTS) in order to ensure that the data is not garbage collected (GC) prematurely. This subsystem underwent through many rounds of changes, resulting in an unintuitive, and potentially dangerous behavior. This PR updates and improves PTS handling as follows. PR cockroachdb#97148 introduce capability to cancel jobs that hold on to stale PTS records. This PR expands this functionality to apply to all jobs -- not just paused jobs. This is necessary because due to cockroachdb#90810, changefeeds will retry almost every error -- and that means that if a running changefeed jobs fails to make progress for very long time, it is possible that a PTS record will protect GC collection for many days, weeks, or even months. To guard against this case, introduce a new cluster setting `changefeed.protect_timestamp.max_age`, which defaults to generous 4 days, to make sure that even if the explicit changefeed option `gc_protect_expires_after` was not specified, the changefeed will fail after `changefeed.protect_timestamp.max_age` if no progress is made. The fail safe can be disabled by setting `changefeed.protect_timestamp.max_age` to 0; Note, however, that doing so could result in stability issues once stale PTS record released. In addition, this PR deprecates `protect_data_from_gc_on_pause` option. This option is not needed since we now employ "active protected timestamp" management (meaning: there is always a PTS record when running changefeed jobs), and the handling of this record is consistent for both running and paused jobs. Fixes cockroachdb#103464 Release note (enterprise change): Introduce a new `changefeed.protect_timestamp.max_age` setting (default 4 days), which will cancel running changefeed jobs if they fail to make forward progress for much time. This setting is used if the explicit `gc_protect_expires_after` option was not set. In addition, deprecate `protect_data_from_gc_on_pause` option. This option is no longer needed since changefeed jobs always protect data.

Changefeeds utilize protected timestamp (PTS) in order to ensure that the data is not garbage collected (GC) prematurely. This subsystem underwent through many rounds of changes, resulting in an unintuitive, and potentially dangerous behavior. This PR updates and improves PTS handling as follows. PR cockroachdb#97148 introduce capability to cancel jobs that hold on to stale PTS records. This PR expands this functionality to apply to all jobs -- not just paused jobs. This is necessary because due to cockroachdb#90810, changefeeds will retry almost every error -- and that means that if a running changefeed jobs fails to make progress for very long time, it is possible that a PTS record will protect GC collection for many days, weeks, or even months. To guard against this case, introduce a new cluster setting `changefeed.protect_timestamp.max_age`, which defaults to generous 4 days, to make sure that even if the explicit changefeed option `gc_protect_expires_after` was not specified, the changefeed will fail after `changefeed.protect_timestamp.max_age` if no progress is made. This setting only applies to newly created changefeeds. Use `ALTER CHANGEFEED` statement to set `gc_protect_expires_after` option for existing changefeeds to enable PTS expiration. The fail safe can be disabled by setting `changefeed.protect_timestamp.max_age` to 0; Note, however, that doing so could result in stability issues once stale PTS record released. In addition, this PR deprecates `protect_data_from_gc_on_pause` option. This option is not needed since we now employ "active protected timestamp" management (meaning: there is always a PTS record when running changefeed jobs), and the handling of this record is consistent for both running and paused jobs. Fixes cockroachdb#103464 Release note (enterprise change): Introduce a new `changefeed.protect_timestamp.max_age` setting (default 4 days), which will cancel running changefeed jobs if they fail to make forward progress for much time. This setting is used if the explicit `gc_protect_expires_after` option was not set. In addition, deprecate `protect_data_from_gc_on_pause` option. This option is no longer needed since changefeed jobs always protect data.

103528: server: Revert server.shutdown.jobs_wait to 0 r=miretskiy a=miretskiy Revert default setting for `server.shutdown.jobs_wait` to 0 to ensure that shutdown dows not wait for active jobs. Issues: none Epic: None Release note: None 103539: changefeedccl: Improve protected timestamp handling r=miretskiy a=miretskiy Changefeeds utilize protected timestamp (PTS) in order to ensure that the data is not garbage collected (GC) prematurely. This subsystem underwent through many rounds of changes, resulting in an unintuitive, and potentially dangerous behavior. This PR updates and improves PTS handling as follows. PR #97148 introduce capability to cancel jobs that hold on to stale PTS records. This PR expands this functionality to apply to all jobs -- not just paused jobs. This is necessary because due to #90810, changefeeds will retry almost every error -- and that means that if a running changefeed jobs fails to make progress for very long time, it is possible that a PTS record will protect GC collection for many days, weeks, or even months. To guard against this case, introduce a new cluster setting `changefeed.protect_timestamp.max_age`, which defaults to generous 4 days, to make sure that even if the explicit changefeed option `gc_protect_expires_after` was not specified, the changefeed will fail after `changefeed.protect_timestamp.max_age` if no progress is made. The fail safe can be disabled by setting `changefeed.protect_timestamp.max_age` to 0; Note, however, that doing so could result in stability issues once stale PTS record released. In addition, this PR deprecates `protect_data_from_gc_on_pause` option. This option is not needed since we now employ "active protected timestamp" management (meaning: there is always a PTS record when running changefeed jobs), and the handling of this record is consistent for both running and paused jobs. Fixes #103464 Release note (enterprise change): Introduce a new `changefeed.protect_timestamp.max_age` setting (default 4 days), which will cancel running changefeed jobs if they fail to make forward progress for much time. This setting is used if the explicit `gc_protect_expires_after` option was not set. In addition, deprecate `protect_data_from_gc_on_pause` option. This option is no longer needed since changefeed jobs always protect data. Co-authored-by: Yevgeniy Miretskiy <[email protected]>

miretskiy requested a review from a team as a code owner October 27, 2022 22:04

miretskiy requested review from a team and HonoreDB and removed request for a team October 27, 2022 22:04

HonoreDB reviewed Oct 28, 2022

View reviewed changes

miretskiy requested a review from HonoreDB October 31, 2022 23:56

miretskiy commented Oct 31, 2022

View reviewed changes

miretskiy force-pushed the err_retry branch 5 times, most recently from f743aea to 1a93051 Compare November 2, 2022 12:30

HonoreDB approved these changes Nov 2, 2022

View reviewed changes

miretskiy force-pushed the err_retry branch 4 times, most recently from 5156fa4 to 1853e41 Compare November 3, 2022 21:32

craig bot merged commit 4c828ca into cockroachdb:master Nov 6, 2022

cockroach-teamcity mentioned this pull request Nov 7, 2022

PR #90810 - changefeedccl: Rework error handling cockroachdb/docs#15564

Closed

miretskiy mentioned this pull request Nov 14, 2022

roachtest: cdc/sink-chaos failed #91384

Closed

exalate-issue-sync bot mentioned this pull request Nov 29, 2022

Add info about changefeed retryable errors cockroachdb/docs#11686

Closed

miretskiy mentioned this pull request Dec 6, 2022

changefeedccl: Fix error handling in mixed version state #93155

Merged

miretskiy mentioned this pull request May 17, 2023

changefeedccl: Improve protected timestamp handling #103539

Merged

jayshrivastava mentioned this pull request Jul 20, 2023

roachtest: cdc/mixed-versions failed #106878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: Rework error handling #90810

changefeedccl: Rework error handling #90810

miretskiy commented Oct 27, 2022 •

edited

Loading

cockroach-teamcity commented Oct 27, 2022

HonoreDB left a comment

miretskiy left a comment

HonoreDB left a comment

miretskiy commented Nov 2, 2022

miretskiy commented Nov 4, 2022

craig bot commented Nov 4, 2022

miretskiy commented Nov 4, 2022

craig bot commented Nov 4, 2022

craig bot commented Nov 4, 2022

miretskiy commented Nov 4, 2022

craig bot commented Nov 4, 2022

craig bot commented Nov 6, 2022

craig bot commented Nov 6, 2022

samiskin commented Nov 28, 2022

miretskiy commented Nov 28, 2022

amruss commented Nov 29, 2022

miretskiy commented Nov 30, 2022

blathers-crl bot commented Nov 30, 2022

changefeedccl: Rework error handling #90810

changefeedccl: Rework error handling #90810

Conversation

miretskiy commented Oct 27, 2022 • edited Loading

cockroach-teamcity commented Oct 27, 2022

HonoreDB left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

HonoreDB left a comment

Choose a reason for hiding this comment

miretskiy commented Nov 2, 2022

miretskiy commented Nov 4, 2022

craig bot commented Nov 4, 2022

miretskiy commented Nov 4, 2022

craig bot commented Nov 4, 2022

craig bot commented Nov 4, 2022

miretskiy commented Nov 4, 2022

craig bot commented Nov 4, 2022

craig bot commented Nov 6, 2022

craig bot commented Nov 6, 2022

samiskin commented Nov 28, 2022

miretskiy commented Nov 28, 2022

amruss commented Nov 29, 2022

miretskiy commented Nov 30, 2022

blathers-crl bot commented Nov 30, 2022

miretskiy commented Oct 27, 2022 •

edited

Loading