cdc: copy request body when registering schemas, add schema registry retry metric #98135

jayshrivastava · 2023-03-07T14:50:15Z

cdc: copy request body when registering schemas

Previously, when the schema registry encountered an error when
registering a schema, it would retry the request. The problem
is that upon hitting an error, we clean the body before retrying.
Retrying with an empty body results in a obscure error message.
With this change, we now retry with the original request body
so the original error is sustained.

This change also adds the metric changefeed.schema_registry.retry_count
which is a counter for the number of retries performed by the schema
registry. Seeing nonzero values indicates that there is an issue
with contacting the schema registry and/or registering schemas.

Release note (ops change): A new metric changefeed.schema_registry.retry_count
is added. This measures the number of request retries performed when
sending requests to the schema registry. Observing a nonzero value may indicate
improper configuration of the schema registry or changefeed parameters.
Epic: None

blathers-crl · 2023-03-07T14:50:19Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-03-07T14:50:24Z

This change is

jayshrivastava · 2023-03-07T16:21:33Z

pkg/ccl/changefeedccl/schema_registry.go

@@ -228,7 +225,7 @@ func (r *confluentSchemaRegistry) doWithRetry(ctx context.Context, fn func() err
 		}
 		log.VInfof(ctx, 2, "retrying schema registry operation: %s", err.Error())
 	}
-	return changefeedbase.MarkRetryableError(err)


I think I will let this PR cook for a few days before backporting. I'm worried that we'll get random roachtest failures bc of this.

miretskiy · 2023-03-07T16:22:50Z

pkg/ccl/changefeedccl/schema_registry.go

@@ -228,7 +225,7 @@ func (r *confluentSchemaRegistry) doWithRetry(ctx context.Context, fn func() err
 		}
 		log.VInfof(ctx, 2, "retrying schema registry operation: %s", err.Error())
 	}
-	return changefeedbase.MarkRetryableError(err)
+	return changefeedbase.WithTerminalError(err)


I'm not sure we should be doing that -- terminal error stuff. We retry all errors that come up due to downstream failure.. Failure to contact external schema registry should be treated same way. Retry few times, then fail with retryable error.

What we should do instead is a) change retry to be something more reasonable than 2. Make it 10; make it backoff more aggressively. Better yet, use retry.go.

Also, what we should do is to add a metric for schema registry.
I know it's more of an addition, but I think the lack of any observability was detrimental here.

I'm ok with using retry.go here and adding observability, but I think this should be a permanent error. Changefeeds will retry forever if they see retryable errors right? If this code retries 10 times and fails, the changefeed should probably fail. For example, what if they get a 401 due to incorrect SSL parameters?

Retrying forever is a customer concern: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1678197741759369?thread_ts=1677783523.194569&cid=C9TGBJB44

I'm ok with using retry.go here and adding observability, but I think this should be a permanent error. Changefeeds will retry forever if they see retryable errors right? If this code retries 10 times and fails, the changefeed should probably fail. For example, what if they get a 401 due to incorrect SSL parameters?

Retrying forever is a customer concern: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1678197741759369?thread_ts=1677783523.194569&cid=C9TGBJB44

401 may very well happen, but why is schema registry should be treated differently from say kafka:// URI? or webhook?
We retry forever in those cases. I think consistency here is better.
Furthermore, customer concerns around retries forever could be better resolved 1. by adding better tools to validate sink and schema registry validation (btw... do we validate schema registry when we create changefeed? -- that would be a good place to fail). 2. Customers should monitor stats, including retry rates, and alert on those.

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

pkg/ccl/changefeedccl/schema_registry.go line 228 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I'm ok with using retry.go here and adding observability, but I think this should be a permanent error. Changefeeds will retry forever if they see retryable errors right? If this code retries 10 times and fails, the changefeed should probably fail. For example, what if they get a 401 due to incorrect SSL parameters?

Retrying forever is a customer concern: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1678197741759369?thread_ts=1677783523.194569&cid=C9TGBJB44

401 may very well happen, but why is schema registry should be treated differently from say kafka:// URI? or webhook?
We retry forever in those cases. I think consistency here is better.
Furthermore, customer concerns around retries forever could be better resolved 1. by adding better tools to validate sink and schema registry validation (btw... do we validate schema registry when we create changefeed? -- that would be a good place to fail). 2. Customers should monitor stats, including retry rates, and alert on those.

Okay we can keep it retryable. I think a small number of attempts (5) for this loop is fine. We retry on a higher level (retry.go) with a max backoff of 10 mins anyways.

pkg/ccl/changefeedccl/schema_registry.go line 234 at r2 (raw file):

		}
		r.sliMetrics.SchemaRegistryRetries.Inc(1)
		log.VInfof(ctx, 1, "retrying schema registry operation: %s", err.Error())

Turns out we had logging the whole time.

jayshrivastava · 2023-03-09T19:21:19Z

bors r+

craig · 2023-03-09T20:25:46Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-03-09T20:25:58Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 972b85e to blathers/backport-release-21.2-98135: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

error creating merge commit from 972b85e to blathers/backport-release-22.1-98135: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

error creating merge commit from 972b85e to blathers/backport-release-22.2-98135: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

This reverts commit 972b85e from cockroachdb#98135. The above commit has a maflormed commit message. This commit will be reapplied subsequently. Epic: None Release Note: None

96321: sql: vectorized data encoding package r=yuzefovich a=cucaroach colenc is a new package that allows kv.Batch's to be produced to encode tables using coldata.Batch's as the input. Every attempt was made to avoid code duplication and delegate to row encoding code where possible. Epic: CRDB-18892 Informs: #91831 Release note: None 98312: ui: make upgrade error message encompass more tables r=xinhaoz a=xinhaoz Previously, we only show the upgrade error message when querying using the sql api in the cases where the server error message says either a column does not exist or the specific table `crdb_internal.txn_cluster_insights`. This commit expands that to any table. Epic: none Release note (ui change): users will see the upgrade error message when a response from the sql api says that a relation or column does not exist 98338: cdc: reapply commit with malformed message r=samiskin a=jayshrivastava ### cdc: revert commit with malformed message This reverts commit 972b85e from #98135. The above commit has a maflormed commit message. This commit will be reapplied subsequently. Epic: None Release Note: None --- ### cdc: copy request body when registering schemas Note: This commit message reapplies the commit 972b85e, which was previously reverted for having an incorrect commit message. Previously, when the schema registry encountered an error when registering a schema, it would retry the request. The problem is that upon hitting an error, we clean the body before retrying. Retrying with an empty body results in a obscure error message. With this change, we now retry with the original request body so the original error is sustained. This change also adds the metric changefeed.schema_registry.retry_count which is a counter for the number of retries performed by the schema registry. Seeing nonzero values indicates that there is an issue with contacting the schema registry and/or registering schemas. Release note (ops change): A new metric changefeed.schema_registry.retry_count is added. This measures the number of request retries performed when sending requests to the schema registry. Observing a nonzero value may indicate improper configuration of the schema registry or changefeed parameters. Epic: None Informs: https://github.com/cockroachlabs/support/issues/2131 Informs: https://github.com/cockroachlabs/support/issues/2129 Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]>

jayshrivastava force-pushed the schemareg branch 2 times, most recently from cac061b to 865b983 Compare March 7, 2023 16:08

jayshrivastava added backport-21.2.x labels Mar 7, 2023

jayshrivastava requested a review from miretskiy March 7, 2023 16:17

jayshrivastava marked this pull request as ready for review March 7, 2023 16:17

jayshrivastava requested a review from a team as a code owner March 7, 2023 16:17

jayshrivastava commented Mar 7, 2023

View reviewed changes

miretskiy suggested changes Mar 7, 2023

View reviewed changes

jayshrivastava force-pushed the schemareg branch 3 times, most recently from 2392cb4 to b38eb17 Compare March 8, 2023 20:39

jayshrivastava commented Mar 8, 2023

View reviewed changes

jayshrivastava requested a review from miretskiy March 8, 2023 20:43

miretskiy approved these changes Mar 8, 2023

View reviewed changes

jayshrivastava force-pushed the schemareg branch from b38eb17 to 43be34b Compare March 9, 2023 15:36

jayshrivastava requested a review from a team March 9, 2023 15:36

jayshrivastava force-pushed the schemareg branch from 43be34b to ea1a275 Compare March 9, 2023 15:58

asdf

972b85e

jayshrivastava force-pushed the schemareg branch from ea1a275 to 972b85e Compare March 9, 2023 18:16

craig bot merged commit 6f444d4 into cockroachdb:master Mar 9, 2023

jayshrivastava mentioned this pull request Mar 9, 2023

cdc: reapply commit with malformed message #98338

Merged

shermanCRL changed the title ~~cdc: copy request body when registering schemas~~ cdc: copy request body when registering schemas, add schema registry retry metric Apr 4, 2023

shermanCRL mentioned this pull request Apr 4, 2023

cdc,dbconsole: add Kafka schema registry retries to DB Console #100640

Closed

shermanCRL mentioned this pull request May 2, 2023

cdc: avoid duplicate schema registrations #99221

Open

jayshrivastava mentioned this pull request Jul 19, 2023

Retry logic for schema registry appears broken #77480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cdc: copy request body when registering schemas, add schema registry retry metric #98135

cdc: copy request body when registering schemas, add schema registry retry metric #98135

jayshrivastava commented Mar 7, 2023 •

edited

Loading

blathers-crl bot commented Mar 7, 2023

cockroach-teamcity commented Mar 7, 2023

jayshrivastava Mar 7, 2023

miretskiy Mar 7, 2023

miretskiy Mar 7, 2023

jayshrivastava Mar 7, 2023

miretskiy Mar 8, 2023

jayshrivastava left a comment

jayshrivastava commented Mar 9, 2023

craig bot commented Mar 9, 2023

blathers-crl bot commented Mar 9, 2023

cdc: copy request body when registering schemas, add schema registry retry metric #98135

cdc: copy request body when registering schemas, add schema registry retry metric #98135

Conversation

jayshrivastava commented Mar 7, 2023 • edited Loading

blathers-crl bot commented Mar 7, 2023

cockroach-teamcity commented Mar 7, 2023

jayshrivastava Mar 7, 2023

Choose a reason for hiding this comment

miretskiy Mar 7, 2023

Choose a reason for hiding this comment

miretskiy Mar 7, 2023

Choose a reason for hiding this comment

jayshrivastava Mar 7, 2023

Choose a reason for hiding this comment

miretskiy Mar 8, 2023

Choose a reason for hiding this comment

jayshrivastava left a comment

Choose a reason for hiding this comment

jayshrivastava commented Mar 9, 2023

craig bot commented Mar 9, 2023

blathers-crl bot commented Mar 9, 2023

jayshrivastava commented Mar 7, 2023 •

edited

Loading