Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: transient schemafeed errors are not retried #63317

Closed
stevendanna opened this issue Apr 8, 2021 · 5 comments · Fixed by #90810
Closed

changefeedccl: transient schemafeed errors are not retried #63317

stevendanna opened this issue Apr 8, 2021 · 5 comments · Fixed by #90810
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@stevendanna
Copy link
Collaborator

stevendanna commented Apr 8, 2021

Describe the problem

In #63282 we saw that a clock drift issue led to one node being terminated. This then resulted in a changefeed failure because the node termination led to a failure to fetch a table descriptor during schema polling.

It appears that all errors from the schemafeed are treated as fatal, even though they might be caused by transient communication issues.

We may want identify which of the many error returns in fetchDescriptorVersions may be retriable. It seems likely that at least this ought to be:

	res, pErr := kv.SendWrappedWith(ctx, db.NonTransactionalSender(), header, req)
	if log.V(2) {
		log.Infof(ctx, `fetched table descs (%s,%s] took %s`, startTS, endTS, timeutil.Since(start))
	}
	if pErr != nil {
		err := pErr.GoError()
		return nil, errors.Wrapf(err, `fetching changes for %s`, span)
	}

Note thatpErr.GoError() will return an UnhandledRetriableError in some cases, but the rest of the changefeed code doesn't look for that particular type. And, it is isn't clear to me without further verification that that will cover all of the transient errors the function could return.

Jira issue: CRDB-6524

@stevendanna stevendanna added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 8, 2021
@miretskiy
Copy link
Contributor

I honestly think that for sink-based changefeeds, we should consider switching error handling model from "errors are fatal, unless marked as retryable" to "errors are retryable, unless marked fatal". We should also augment retries with addressing #62556 (do not retry forever without making progress).

@stevendanna
Copy link
Collaborator Author

stevendanna commented Apr 8, 2021

I honestly think that for sink-based changefeeds, we should consider switching error handling model from "errors are fatal, unless marked as retryable" to "errors are retryable, unless marked fatal".

I could get behind this, especially if we do a good job on #62556

@knz
Copy link
Contributor

knz commented Apr 9, 2021

Retryable errors are already reliably marked as such, see how the pgerror package detects them.

@ajwerner
Copy link
Contributor

ajwerner commented Apr 9, 2021

My feeling for defaulting to retrying for jobs on errors is that #44594 should be a prerequisite. Having a job spin on failure seems very bad.

@stevendanna
Copy link
Collaborator Author

Retryable errors are already reliably marked as such, see how the pgerror package detects them.

Assuming it correctly marks all retriable errors, there is still work here on the consumer side to make sure the changefeed machinery looks for and responds to the correct retriable types.

My feeling for defaulting to retrying for jobs on errors is that #44594 should be a prerequisite. Having a job spin on failure seems very bad.

Seems reasonable.

@stevendanna stevendanna changed the title transient schemafeed errors are not retried changefeedccl: transient schemafeed errors are not retried Aug 31, 2021
@stevendanna stevendanna removed their assignment Oct 14, 2022
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Oct 27, 2022
Prior to this PR, changefeeds would rely on a white list
approach in order to determine which errors were retryable.
All other errors would be deemed terminal, causing changefeed
to fail.

The above approach is brittle, and causes unwanted
changefeed termination.

This PR changes this approach to treat all errors as retryable,
unless otherwise indicated.  Errors that are known by changefeed
to be fatal are handled explicitly, by marking such errors
as terminal.  For example, changefeeds would exit
if the targetted table is dropped.  On the other hand, inability
to read this table for any reason would not be treated as
terminal.

Fixes cockroachdb#90320
Fixes cockroachdb#77549
Fixes cockroachdb#63317
Fixes cockroachdb#71341
Fixes cockroachdb#73016
Informs CRDB-6788

Release note (enterprise change): Changefeed will now treat
all errors, unless otherwise indicated, as retryable errors.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Nov 2, 2022
Prior to this PR, changefeeds would rely on a white list
approach in order to determine which errors were retryable.
All other errors would be deemed terminal, causing changefeed
to fail.

The above approach is brittle, and causes unwanted
changefeed termination.

This PR changes this approach to treat all errors as retryable,
unless otherwise indicated.  Errors that are known by changefeed
to be fatal are handled explicitly, by marking such errors
as terminal.  For example, changefeeds would exit
if the targetted table is dropped.  On the other hand, inability
to read this table for any reason would not be treated as
terminal.

Fixes cockroachdb#90320
Fixes cockroachdb#77549
Fixes cockroachdb#63317
Fixes cockroachdb#71341
Fixes cockroachdb#73016
Informs CRDB-6788
Informs CRDB-7581

Release note (enterprise change): Changefeed will now treat
all errors, unless otherwise indicated, as retryable errors.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Nov 3, 2022
Prior to this PR, changefeeds would rely on a white list
approach in order to determine which errors were retryable.
All other errors would be deemed terminal, causing changefeed
to fail.

The above approach is brittle, and causes unwanted
changefeed termination.

This PR changes this approach to treat all errors as retryable,
unless otherwise indicated.  Errors that are known by changefeed
to be fatal are handled explicitly, by marking such errors
as terminal.  For example, changefeeds would exit
if the targetted table is dropped.  On the other hand, inability
to read this table for any reason would not be treated as
terminal.

Fixes cockroachdb#90320
Fixes cockroachdb#77549
Fixes cockroachdb#63317
Fixes cockroachdb#71341
Fixes cockroachdb#73016
Informs CRDB-6788
Informs CRDB-7581

Release note (enterprise change): Changefeed will now treat
all errors, unless otherwise indicated, as retryable errors.
HonoreDB pushed a commit to HonoreDB/cockroach that referenced this issue Nov 4, 2022
Prior to this PR, changefeeds would rely on a white list
approach in order to determine which errors were retryable.
All other errors would be deemed terminal, causing changefeed
to fail.

The above approach is brittle, and causes unwanted
changefeed termination.

This PR changes this approach to treat all errors as retryable,
unless otherwise indicated.  Errors that are known by changefeed
to be fatal are handled explicitly, by marking such errors
as terminal.  For example, changefeeds would exit
if the targetted table is dropped.  On the other hand, inability
to read this table for any reason would not be treated as
terminal.

Fixes cockroachdb#90320
Fixes cockroachdb#77549
Fixes cockroachdb#63317
Fixes cockroachdb#71341
Fixes cockroachdb#73016
Informs CRDB-6788
Informs CRDB-7581

Release note (enterprise change): Changefeed will now treat
all errors, unless otherwise indicated, as retryable errors.
craig bot pushed a commit that referenced this issue Nov 6, 2022
90810: changefeedccl: Rework error handling r=miretskiy a=miretskiy

Prior to this PR, changefeeds would rely on a white list
approach in order to determine which errors were retryable.
All other errors would be deemed terminal, causing changefeed
to fail.

The above approach is brittle, and causes unwanted
changefeed termination.

This PR changes this approach to treat all errors as retryable,
unless otherwise indicated.  Errors that are known by changefeed
to be fatal are handled explicitly, by marking such errors
as terminal.  For example, changefeeds would exit
if the targeted table is dropped.  On the other hand, inability
to read this table for any reason would not be treated as
terminal.

Fixes #90320
Fixes #77549
Fixes #63317
Fixes #71341
Fixes #73016
Informs CRDB-6788
Informs CRDB-7581

Release Note (enterprise change): Changefeed will now treat
all errors, unless otherwise indicated, as retryable errors.


Co-authored-by: Yevgeniy Miretskiy <[email protected]>
@craig craig bot closed this as completed in 86fffa9 Nov 6, 2022
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Dec 6, 2022
Prior to this PR, changefeeds would rely on a white list
approach in order to determine which errors were retryable.
All other errors would be deemed terminal, causing changefeed
to fail.

The above approach is brittle, and causes unwanted
changefeed termination.

This PR changes this approach to treat all errors as retryable,
unless otherwise indicated.  Errors that are known by changefeed
to be fatal are handled explicitly, by marking such errors
as terminal.  For example, changefeeds would exit
if the targetted table is dropped.  On the other hand, inability
to read this table for any reason would not be treated as
terminal.

Fixes cockroachdb#90320
Fixes cockroachdb#77549
Fixes cockroachdb#63317
Fixes cockroachdb#71341
Fixes cockroachdb#73016
Informs CRDB-6788
Informs CRDB-7581

Release note (enterprise change): Changefeed will now treat
all errors, unless otherwise indicated, as retryable errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants