-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cdc: copy request body when registering schemas, add schema registry retry metric #98135
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
cac061b
to
865b983
Compare
@@ -228,7 +225,7 @@ func (r *confluentSchemaRegistry) doWithRetry(ctx context.Context, fn func() err | |||
} | |||
log.VInfof(ctx, 2, "retrying schema registry operation: %s", err.Error()) | |||
} | |||
return changefeedbase.MarkRetryableError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will let this PR cook for a few days before backporting. I'm worried that we'll get random roachtest failures bc of this.
@@ -228,7 +225,7 @@ func (r *confluentSchemaRegistry) doWithRetry(ctx context.Context, fn func() err | |||
} | |||
log.VInfof(ctx, 2, "retrying schema registry operation: %s", err.Error()) | |||
} | |||
return changefeedbase.MarkRetryableError(err) | |||
return changefeedbase.WithTerminalError(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should be doing that -- terminal error stuff. We retry all errors that come up due to downstream failure.. Failure to contact external schema registry should be treated same way. Retry few times, then fail with retryable error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we should do instead is a) change retry to be something more reasonable than 2. Make it 10; make it backoff more aggressively. Better yet, use retry.go
.
Also, what we should do is to add a metric for schema registry.
I know it's more of an addition, but I think the lack of any observability was detrimental here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with using retry.go here and adding observability, but I think this should be a permanent error. Changefeeds will retry forever if they see retryable errors right? If this code retries 10 times and fails, the changefeed should probably fail. For example, what if they get a 401 due to incorrect SSL parameters?
Retrying forever is a customer concern: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1678197741759369?thread_ts=1677783523.194569&cid=C9TGBJB44
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with using retry.go here and adding observability, but I think this should be a permanent error. Changefeeds will retry forever if they see retryable errors right? If this code retries 10 times and fails, the changefeed should probably fail. For example, what if they get a 401 due to incorrect SSL parameters?
Retrying forever is a customer concern: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1678197741759369?thread_ts=1677783523.194569&cid=C9TGBJB44
401 may very well happen, but why is schema registry should be treated differently from say kafka:// URI? or webhook?
We retry forever in those cases. I think consistency here is better.
Furthermore, customer concerns around retries forever could be better resolved 1. by adding better tools to validate sink and schema registry validation (btw... do we validate schema registry when we create changefeed? -- that would be a good place to fail). 2. Customers should monitor stats, including retry rates, and alert on those.
2392cb4
to
b38eb17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/ccl/changefeedccl/schema_registry.go
line 228 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I'm ok with using retry.go here and adding observability, but I think this should be a permanent error. Changefeeds will retry forever if they see retryable errors right? If this code retries 10 times and fails, the changefeed should probably fail. For example, what if they get a 401 due to incorrect SSL parameters?
Retrying forever is a customer concern: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1678197741759369?thread_ts=1677783523.194569&cid=C9TGBJB44
401 may very well happen, but why is schema registry should be treated differently from say kafka:// URI? or webhook?
We retry forever in those cases. I think consistency here is better.
Furthermore, customer concerns around retries forever could be better resolved 1. by adding better tools to validate sink and schema registry validation (btw... do we validate schema registry when we create changefeed? -- that would be a good place to fail). 2. Customers should monitor stats, including retry rates, and alert on those.
Okay we can keep it retryable. I think a small number of attempts (5) for this loop is fine. We retry on a higher level (retry.go) with a max backoff of 10 mins anyways.
pkg/ccl/changefeedccl/schema_registry.go
line 234 at r2 (raw file):
} r.sliMetrics.SchemaRegistryRetries.Inc(1) log.VInfof(ctx, 1, "retrying schema registry operation: %s", err.Error())
Turns out we had logging the whole time.
b38eb17
to
43be34b
Compare
43be34b
to
ea1a275
Compare
ea1a275
to
972b85e
Compare
bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 972b85e to blathers/backport-release-21.2-98135: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. error creating merge commit from 972b85e to blathers/backport-release-22.1-98135: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.x failed. See errors above. error creating merge commit from 972b85e to blathers/backport-release-22.2-98135: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This reverts commit 972b85e from cockroachdb#98135. The above commit has a maflormed commit message. This commit will be reapplied subsequently. Epic: None Release Note: None
96321: sql: vectorized data encoding package r=yuzefovich a=cucaroach colenc is a new package that allows kv.Batch's to be produced to encode tables using coldata.Batch's as the input. Every attempt was made to avoid code duplication and delegate to row encoding code where possible. Epic: CRDB-18892 Informs: #91831 Release note: None 98312: ui: make upgrade error message encompass more tables r=xinhaoz a=xinhaoz Previously, we only show the upgrade error message when querying using the sql api in the cases where the server error message says either a column does not exist or the specific table `crdb_internal.txn_cluster_insights`. This commit expands that to any table. Epic: none Release note (ui change): users will see the upgrade error message when a response from the sql api says that a relation or column does not exist 98338: cdc: reapply commit with malformed message r=samiskin a=jayshrivastava ### cdc: revert commit with malformed message This reverts commit 972b85e from #98135. The above commit has a maflormed commit message. This commit will be reapplied subsequently. Epic: None Release Note: None --- ### cdc: copy request body when registering schemas Note: This commit message reapplies the commit 972b85e, which was previously reverted for having an incorrect commit message. Previously, when the schema registry encountered an error when registering a schema, it would retry the request. The problem is that upon hitting an error, we clean the body before retrying. Retrying with an empty body results in a obscure error message. With this change, we now retry with the original request body so the original error is sustained. This change also adds the metric changefeed.schema_registry.retry_count which is a counter for the number of retries performed by the schema registry. Seeing nonzero values indicates that there is an issue with contacting the schema registry and/or registering schemas. Release note (ops change): A new metric changefeed.schema_registry.retry_count is added. This measures the number of request retries performed when sending requests to the schema registry. Observing a nonzero value may indicate improper configuration of the schema registry or changefeed parameters. Epic: None Informs: https://github.com/cockroachlabs/support/issues/2131 Informs: https://github.com/cockroachlabs/support/issues/2129 Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Xin Hao Zhang <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]>
cdc: copy request body when registering schemas
Previously, when the schema registry encountered an error when
registering a schema, it would retry the request. The problem
is that upon hitting an error, we clean the body before retrying.
Retrying with an empty body results in a obscure error message.
With this change, we now retry with the original request body
so the original error is sustained.
This change also adds the metric
changefeed.schema_registry.retry_count
which is a counter for the number of retries performed by the schema
registry. Seeing nonzero values indicates that there is an issue
with contacting the schema registry and/or registering schemas.
Release note (ops change): A new metric
changefeed.schema_registry.retry_count
is added. This measures the number of request retries performed when
sending requests to the schema registry. Observing a nonzero value may indicate
improper configuration of the schema registry or changefeed parameters.
Epic: None