changefeedccl: transient schemafeed errors are not retried #63317

stevendanna · 2021-04-08T16:15:35Z

Describe the problem

In #63282 we saw that a clock drift issue led to one node being terminated. This then resulted in a changefeed failure because the node termination led to a failure to fetch a table descriptor during schema polling.

It appears that all errors from the schemafeed are treated as fatal, even though they might be caused by transient communication issues.

We may want identify which of the many error returns in fetchDescriptorVersions may be retriable. It seems likely that at least this ought to be:

	res, pErr := kv.SendWrappedWith(ctx, db.NonTransactionalSender(), header, req)
	if log.V(2) {
		log.Infof(ctx, `fetched table descs (%s,%s] took %s`, startTS, endTS, timeutil.Since(start))
	}
	if pErr != nil {
		err := pErr.GoError()
		return nil, errors.Wrapf(err, `fetching changes for %s`, span)
	}

Note thatpErr.GoError() will return an UnhandledRetriableError in some cases, but the rest of the changefeed code doesn't look for that particular type. And, it is isn't clear to me without further verification that that will cover all of the transient errors the function could return.

Jira issue: CRDB-6524

The text was updated successfully, but these errors were encountered:

miretskiy · 2021-04-08T16:33:51Z

I honestly think that for sink-based changefeeds, we should consider switching error handling model from "errors are fatal, unless marked as retryable" to "errors are retryable, unless marked fatal". We should also augment retries with addressing #62556 (do not retry forever without making progress).

stevendanna · 2021-04-08T16:48:00Z

I honestly think that for sink-based changefeeds, we should consider switching error handling model from "errors are fatal, unless marked as retryable" to "errors are retryable, unless marked fatal".

I could get behind this, especially if we do a good job on #62556

knz · 2021-04-09T05:54:12Z

Retryable errors are already reliably marked as such, see how the pgerror package detects them.

ajwerner · 2021-04-09T06:00:46Z

My feeling for defaulting to retrying for jobs on errors is that #44594 should be a prerequisite. Having a job spin on failure seems very bad.

stevendanna · 2021-04-09T07:21:12Z

Retryable errors are already reliably marked as such, see how the pgerror package detects them.

Assuming it correctly marks all retriable errors, there is still work here on the consumer side to make sure the changefeed machinery looks for and responds to the correct retriable types.

My feeling for defaulting to retrying for jobs on errors is that #44594 should be a prerequisite. Having a job spin on failure seems very bad.

Seems reasonable.

Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targetted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes cockroachdb#90320 Fixes cockroachdb#77549 Fixes cockroachdb#63317 Fixes cockroachdb#71341 Fixes cockroachdb#73016 Informs CRDB-6788 Release note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors.

Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targetted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes cockroachdb#90320 Fixes cockroachdb#77549 Fixes cockroachdb#63317 Fixes cockroachdb#71341 Fixes cockroachdb#73016 Informs CRDB-6788 Informs CRDB-7581 Release note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors.

90810: changefeedccl: Rework error handling r=miretskiy a=miretskiy Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targeted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes #90320 Fixes #77549 Fixes #63317 Fixes #71341 Fixes #73016 Informs CRDB-6788 Informs CRDB-7581 Release Note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors. Co-authored-by: Yevgeniy Miretskiy <[email protected]>

Prior to this PR, changefeeds would rely on a white list approach in order to determine which errors were retryable. All other errors would be deemed terminal, causing changefeed to fail. The above approach is brittle, and causes unwanted changefeed termination. This PR changes this approach to treat all errors as retryable, unless otherwise indicated. Errors that are known by changefeed to be fatal are handled explicitly, by marking such errors as terminal. For example, changefeeds would exit if the targetted table is dropped. On the other hand, inability to read this table for any reason would not be treated as terminal. Fixes cockroachdb#90320 Fixes cockroachdb#77549 Fixes cockroachdb#63317 Fixes cockroachdb#71341 Fixes cockroachdb#73016 Informs CRDB-6788 Informs CRDB-7581 Release note (enterprise change): Changefeed will now treat all errors, unless otherwise indicated, as retryable errors.

stevendanna added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 8, 2021

stevendanna mentioned this issue Apr 8, 2021

roachtest: cdc/bank failed #63282

Closed

amruss assigned stevendanna Apr 19, 2021

stevendanna mentioned this issue May 4, 2021

changefeedccl: clarify changefeed retry behavior #64646

Closed

stevendanna changed the title ~~transient schemafeed errors are not retried~~ changefeedccl: transient schemafeed errors are not retried Aug 31, 2021

stevendanna removed their assignment Oct 14, 2022

miretskiy mentioned this issue Oct 27, 2022

changefeedccl: Rework error handling #90810

Merged

craig bot closed this as completed in 86fffa9 Nov 6, 2022

miretskiy mentioned this issue Dec 1, 2022

changefeedccl: Rework error handling #92824

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: transient schemafeed errors are not retried #63317

changefeedccl: transient schemafeed errors are not retried #63317

stevendanna commented Apr 8, 2021 •

edited by cockroach-jira-scripts

Loading

miretskiy commented Apr 8, 2021

stevendanna commented Apr 8, 2021 •

edited by shermanCRL

Loading

knz commented Apr 9, 2021

ajwerner commented Apr 9, 2021

stevendanna commented Apr 9, 2021

changefeedccl: transient schemafeed errors are not retried #63317

changefeedccl: transient schemafeed errors are not retried #63317

Comments

stevendanna commented Apr 8, 2021 • edited by cockroach-jira-scripts Loading

miretskiy commented Apr 8, 2021

stevendanna commented Apr 8, 2021 • edited by shermanCRL Loading

knz commented Apr 9, 2021

ajwerner commented Apr 9, 2021

stevendanna commented Apr 9, 2021

stevendanna commented Apr 8, 2021 •

edited by cockroach-jira-scripts

Loading

stevendanna commented Apr 8, 2021 •

edited by shermanCRL

Loading