Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: Rework error handling #90810

Merged
merged 1 commit into from
Nov 6, 2022
Merged

Conversation

miretskiy
Copy link
Contributor

@miretskiy miretskiy commented Oct 27, 2022

Prior to this PR, changefeeds would rely on a white list
approach in order to determine which errors were retryable.
All other errors would be deemed terminal, causing changefeed
to fail.

The above approach is brittle, and causes unwanted
changefeed termination.

This PR changes this approach to treat all errors as retryable,
unless otherwise indicated. Errors that are known by changefeed
to be fatal are handled explicitly, by marking such errors
as terminal. For example, changefeeds would exit
if the targeted table is dropped. On the other hand, inability
to read this table for any reason would not be treated as
terminal.

Fixes #90320
Fixes #77549
Fixes #63317
Fixes #71341
Fixes #73016
Informs CRDB-6788
Informs CRDB-7581

Release Note (enterprise change): Changefeed will now treat
all errors, unless otherwise indicated, as retryable errors.

@miretskiy miretskiy requested a review from a team as a code owner October 27, 2022 22:04
@miretskiy miretskiy requested review from a team and HonoreDB and removed request for a team October 27, 2022 22:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@HonoreDB HonoreDB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, I agree it's a good idea. I do think if we're defaulting to retryable, we need to make that conditional on the feed having already checkpointed at least once. Otherwise, too many problems that require manual intervention will manifest as a changefeed that is always running but never makes progress. A changefeed that's never made progress isn't worth saving--the rare case that you get a transient error right away isn't worth it.

Reviewed 4 of 4 files at r1, 5 of 23 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/ccl/changefeedccl/avro.go line 412 at r2 (raw file):

				date := *d.(*tree.DDate)
				if !date.IsFinite() {
					return nil, changefeedbase.WithTerminalError(errors.Errorf(

Might be better to have all errors raised out of an encoder be marked terminal unless marked retryable, the only retryable error I can think of that can come out of an encoder is a schema registry connectivity failure.


pkg/ccl/changefeedccl/changefeed_processors.go line 1042 at r2 (raw file):

				if cf.frontier.boundaryType == jobspb.ResolvedSpan_EXIT {
					err = changefeedbase.WithTerminalError(errors.Wrapf(err,
						"shut down due to schema change and %s=%q",

Is this always a schema change or can it be end time/initial scan only?


pkg/ccl/changefeedccl/retry.go line 51 at r2 (raw file):

const resetRetryAfter = 10 * time.Minute

// Retry is a think wrapper around retry.Retry which

thin?


pkg/sql/catalog/lease/lease.go line 1019 at r2 (raw file):

}

// IsDraining returns true if this nodes lease manager is draining.

node's

Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'm not sure I agree with this. I'm all for detecting errors, doing sanity checking during
changefeed creation. The reality is that changefeed may run for a while w/out checkpoint -- i.e. during initial scan.
Having some arbitrary logic that effectively says that "we will retry, but only if changefeed ran for 10 or more minutes" seems
to be rather arbitrary and not that great either. Lots of things can happen in the cluster during that time.

None of those heuristics would replace the need to proper monitoring and observability.
So.. I'd rather improve those areas if needed, vs above.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB)


pkg/ccl/changefeedccl/avro.go line 412 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Might be better to have all errors raised out of an encoder be marked terminal unless marked retryable, the only retryable error I can think of that can come out of an encoder is a schema registry connectivity failure.

These are the only two cases that exist today where this happpens. If we get into situation where we are returning majority of those errors -- then perhaps we can reconsider.
Do you feel strongly?


pkg/ccl/changefeedccl/changefeed_processors.go line 1042 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

Is this always a schema change or can it be end time/initial scan only?

I believe so; we check for endTime.Empty above or if we haven't reached end time


pkg/ccl/changefeedccl/retry.go line 51 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

thin?

Done.


pkg/sql/catalog/lease/lease.go line 1019 at r2 (raw file):

Previously, HonoreDB (Aaron Zinger) wrote…

node's

Done.

@miretskiy miretskiy force-pushed the err_retry branch 5 times, most recently from f743aea to 1a93051 Compare November 2, 2022 12:30
Copy link
Contributor

@HonoreDB HonoreDB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really go for 10 minutes without an initial scan checkpoint? It feels like there's some kind of missing communication, anyway, if it's ever possible for a healthy changefeed to make progress for 10 minutes without reporting it. Maybe we should, say, have special-case logic that if you're a changefeed aggregator on the same node as the changefeed frontier/coordinator, you immediately flush and checkpoint (or initial-scan-checkpoint) after your first successful emit.

Reviewed 1 of 16 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/ccl/changefeedccl/avro.go line 412 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

These are the only two cases that exist today where this happpens. If we get into situation where we are returning majority of those errors -- then perhaps we can reconsider.
Do you feel strongly?

I guess if it's just these two and we have the possibility of retryable schema registry errors, this seems fine as is.

@miretskiy
Copy link
Contributor Author

Should we really go for 10 minutes without an initial scan checkpoint?

Yeah -- absolutely; initial scan checkpoints are expensive, and, by default are set to happen every 10 minutes.

@miretskiy miretskiy force-pushed the err_retry branch 4 times, most recently from 5156fa4 to 1853e41 Compare November 3, 2022 21:32
@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 4, 2022

Build failed:

@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 4, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 4, 2022

Build failed:

@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 4, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 6, 2022

🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

@craig
Copy link
Contributor

craig bot commented Nov 6, 2022

Build succeeded:

@samiskin
Copy link
Contributor

@miretskiy do we plan on backporting this change to 22.1?

@miretskiy
Copy link
Contributor Author

22.1? Probably not. 22.2.x -- maybe?

@amruss
Copy link
Contributor

amruss commented Nov 29, 2022

Let me know if we change our minds on backporting this to 22.2.1 -> going to include it in the docs as a breaking change

@miretskiy
Copy link
Contributor Author

blathers backport 22.2

@blathers-crl
Copy link

blathers-crl bot commented Nov 30, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 86fffa9 to blathers/backport-release-22.2-90810: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Dec 7, 2022
Prior PR cockroachdb#90810 changed error handling approach used
by CDC to be "retry by default".

The above PR failed to account for some of the
situations that may arrise in mixed version state.
In particular, when upgrading, if the aggregator node
gets restarted with the new version, while the coordinator
still runs old binary, *and* the new aggregator encounters
a retryable error, that error would not be explicitly marked
as retryable, causing the old coordinator binary to treat
the error as a terminal error.

Other combination (new coordinator, old aggregator) is not
succeptible to this situation.

This PR partially reverts changes in the cockroachdb#90810 so previously
retryable errors continue to be explicitly marked as retryable.

There is no need to introduce version gates since the error
handling should be backward compatible, and with this PR, operate
correctly in the mixed version state.

Epic: None

Release note: none
craig bot pushed a commit that referenced this pull request Dec 7, 2022
93072: backupccl: speed up listing in some backup callsites r=msbutler a=adityamaru

There were a couple of calls to `List` that would list all the files in the backup destination bucket. In the presence of a large number of files, usually SST files in the `data/` folder, this listing could be very slow. This could manifest as a RESTORE command hanging during the planning phase when we go to resolve encryption information from the base backup bucket.

To fix this, we add a `delimiter` to the `List` call that will make skipping over all files in the `data/` directory much cheaper.

Informs: https://github.com/cockroachlabs/support/issues/1937

Release note (bug fix): speedup slow listing calls that could manifest as restore queries hanging during execution, in the presence of several backup files

93155: changefeedccl: Fix error handling in mixed version state r=miretskiy a=miretskiy

Prior PR #90810 changed error handling approach used by CDC to be "retry by default".

The above PR failed to account for some of the
situations that may arrise in mixed version state. In particular, when upgrading, if the aggregator node gets restarted with the new version, while the coordinator still runs old binary, *and* the new aggregator encounters a retryable error, that error would not be explicitly marked as retryable, causing the old coordinator binary to treat the error as a terminal error.

Other combination (new coordinator, old aggregator) is not succeptible to this situation.

This PR partially reverts changes in the #90810 so previously retryable errors continue to be explicitly marked as retryable.

There is no need to introduce version gates since the error handling should be backward compatible, and with this PR, operate correctly in the mixed version state.

Epic: None

Release note: none

Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Dec 7, 2022
Prior PR cockroachdb#90810 changed error handling approach used
by CDC to be "retry by default".

The above PR failed to account for some of the
situations that may arrise in mixed version state.
In particular, when upgrading, if the aggregator node
gets restarted with the new version, while the coordinator
still runs old binary, *and* the new aggregator encounters
a retryable error, that error would not be explicitly marked
as retryable, causing the old coordinator binary to treat
the error as a terminal error.

Other combination (new coordinator, old aggregator) is not
succeptible to this situation.

This PR partially reverts changes in the cockroachdb#90810 so previously
retryable errors continue to be explicitly marked as retryable.

There is no need to introduce version gates since the error
handling should be backward compatible, and with this PR, operate
correctly in the mixed version state.

Epic: None

Release note: none
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request May 17, 2023
Changefeeds utilize protected timestamp (PTS) in order to ensure
that the data is not garbage collected (GC) prematurely.

This subsystem underwent through many rounds of changes, resulting
in an unintuitive, and potentially dangerous behavior.

This PR updates and improves PTS handling as follows.
PR cockroachdb#97148 introduce capability to cancel jobs that hold on to stale
PTS records.  This PR expands this functionality to apply to all
jobs -- not just paused jobs.

This is necessary because due to cockroachdb#90810, changefeeds will retry almost
every error -- and that means that if a running changefeed jobs fails
to make progress for very long time, it is possible that a PTS record
will protect GC collection for many days, weeks, or even months.

To guard against this case, introduce a new cluster setting
`changefeed.protect_timestamp.max_age`, which defaults to generous 4
days, to make sure that even if the explicit changefeed option
`gc_protect_expires_after` was not specified, the changefeed will fail
after `changefeed.protect_timestamp.max_age` if no progress is made.

The fail safe can be disabled by setting
`changefeed.protect_timestamp.max_age` to 0; Note, however, that doing
so could result in stability issues once stale PTS record released.

In addition, this PR deprecates `protect_data_from_gc_on_pause` option.
This option is not needed since we now employ "active protected
timestamp" management (meaning: there is always a PTS record when
running changefeed jobs), and the handling of this record is consistent
for both running and paused jobs.

Fixes cockroachdb#103464

Release note (enterprise change): Introduce a new
`changefeed.protect_timestamp.max_age` setting (default 4 days), which
will cancel running changefeed jobs if they fail to make forward
progress for much time.  This setting is used if the explicit
`gc_protect_expires_after` option was not set.  In addition, deprecate
`protect_data_from_gc_on_pause` option.  This option is no longer needed
since changefeed jobs always protect data.
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request May 19, 2023
Changefeeds utilize protected timestamp (PTS) in order to ensure
that the data is not garbage collected (GC) prematurely.

This subsystem underwent through many rounds of changes, resulting
in an unintuitive, and potentially dangerous behavior.

This PR updates and improves PTS handling as follows.
PR cockroachdb#97148 introduce capability to cancel jobs that hold on to stale
PTS records.  This PR expands this functionality to apply to all
jobs -- not just paused jobs.

This is necessary because due to cockroachdb#90810, changefeeds will retry almost
every error -- and that means that if a running changefeed jobs fails
to make progress for very long time, it is possible that a PTS record
will protect GC collection for many days, weeks, or even months.

To guard against this case, introduce a new cluster setting
`changefeed.protect_timestamp.max_age`, which defaults to generous 4
days, to make sure that even if the explicit changefeed option
`gc_protect_expires_after` was not specified, the changefeed will fail
after `changefeed.protect_timestamp.max_age` if no progress is made.
This setting only applies to newly created changefeeds.
Use `ALTER CHANGEFEED` statement to set `gc_protect_expires_after` option
for existing changefeeds to enable PTS expiration.

The fail safe can be disabled by setting
`changefeed.protect_timestamp.max_age` to 0; Note, however, that doing
so could result in stability issues once stale PTS record released.

In addition, this PR deprecates `protect_data_from_gc_on_pause` option.
This option is not needed since we now employ "active protected
timestamp" management (meaning: there is always a PTS record when
running changefeed jobs), and the handling of this record is consistent
for both running and paused jobs.

Fixes cockroachdb#103464

Release note (enterprise change): Introduce a new
`changefeed.protect_timestamp.max_age` setting (default 4 days), which
will cancel running changefeed jobs if they fail to make forward
progress for much time.  This setting is used if the explicit
`gc_protect_expires_after` option was not set.  In addition, deprecate
`protect_data_from_gc_on_pause` option.  This option is no longer needed
since changefeed jobs always protect data.
craig bot pushed a commit that referenced this pull request May 19, 2023
103528: server: Revert server.shutdown.jobs_wait to 0 r=miretskiy a=miretskiy

Revert default setting for `server.shutdown.jobs_wait` to 0
to ensure that shutdown dows not wait for active jobs.

Issues: none
Epic: None

Release note: None

103539: changefeedccl: Improve protected timestamp handling r=miretskiy a=miretskiy

Changefeeds utilize protected timestamp (PTS) in order to ensure that the data is not garbage collected (GC) prematurely.

This subsystem underwent through many rounds of changes, resulting in an unintuitive, and potentially dangerous behavior.

This PR updates and improves PTS handling as follows. PR #97148 introduce capability to cancel jobs that hold on to stale PTS records.  This PR expands this functionality to apply to all jobs -- not just paused jobs.

This is necessary because due to #90810, changefeeds will retry almost every error -- and that means that if a running changefeed jobs fails to make progress for very long time, it is possible that a PTS record will protect GC collection for many days, weeks, or even months.

To guard against this case, introduce a new cluster setting `changefeed.protect_timestamp.max_age`, which defaults to generous 4 days, to make sure that even if the explicit changefeed option `gc_protect_expires_after` was not specified, the changefeed will fail after `changefeed.protect_timestamp.max_age` if no progress is made.

The fail safe can be disabled by setting
`changefeed.protect_timestamp.max_age` to 0; Note, however, that doing so could result in stability issues once stale PTS record released.

In addition, this PR deprecates `protect_data_from_gc_on_pause` option. This option is not needed since we now employ "active protected timestamp" management (meaning: there is always a PTS record when running changefeed jobs), and the handling of this record is consistent for both running and paused jobs.

Fixes #103464

Release note (enterprise change): Introduce a new
`changefeed.protect_timestamp.max_age` setting (default 4 days), which will cancel running changefeed jobs if they fail to make forward progress for much time.  This setting is used if the explicit `gc_protect_expires_after` option was not set.  In addition, deprecate `protect_data_from_gc_on_pause` option.  This option is no longer needed since changefeed jobs always protect data.

Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants