Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: copy request body when registering schemas, add schema registry retry metric #98135

Merged
merged 1 commit into from
Mar 9, 2023

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Mar 7, 2023

cdc: copy request body when registering schemas

Previously, when the schema registry encountered an error when
registering a schema, it would retry the request. The problem
is that upon hitting an error, we clean the body before retrying.
Retrying with an empty body results in a obscure error message.
With this change, we now retry with the original request body
so the original error is sustained.

This change also adds the metric changefeed.schema_registry.retry_count
which is a counter for the number of retries performed by the schema
registry. Seeing nonzero values indicates that there is an issue
with contacting the schema registry and/or registering schemas.

Release note (ops change): A new metric changefeed.schema_registry.retry_count
is added. This measures the number of request retries performed when
sending requests to the schema registry. Observing a nonzero value may indicate
improper configuration of the schema registry or changefeed parameters.
Epic: None

@blathers-crl
Copy link

blathers-crl bot commented Mar 7, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava force-pushed the schemareg branch 2 times, most recently from cac061b to 865b983 Compare March 7, 2023 16:08
@jayshrivastava jayshrivastava marked this pull request as ready for review March 7, 2023 16:17
@jayshrivastava jayshrivastava requested a review from a team as a code owner March 7, 2023 16:17
@@ -228,7 +225,7 @@ func (r *confluentSchemaRegistry) doWithRetry(ctx context.Context, fn func() err
}
log.VInfof(ctx, 2, "retrying schema registry operation: %s", err.Error())
}
return changefeedbase.MarkRetryableError(err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will let this PR cook for a few days before backporting. I'm worried that we'll get random roachtest failures bc of this.

@@ -228,7 +225,7 @@ func (r *confluentSchemaRegistry) doWithRetry(ctx context.Context, fn func() err
}
log.VInfof(ctx, 2, "retrying schema registry operation: %s", err.Error())
}
return changefeedbase.MarkRetryableError(err)
return changefeedbase.WithTerminalError(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should be doing that -- terminal error stuff. We retry all errors that come up due to downstream failure.. Failure to contact external schema registry should be treated same way. Retry few times, then fail with retryable error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we should do instead is a) change retry to be something more reasonable than 2. Make it 10; make it backoff more aggressively. Better yet, use retry.go.

Also, what we should do is to add a metric for schema registry.
I know it's more of an addition, but I think the lack of any observability was detrimental here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with using retry.go here and adding observability, but I think this should be a permanent error. Changefeeds will retry forever if they see retryable errors right? If this code retries 10 times and fails, the changefeed should probably fail. For example, what if they get a 401 due to incorrect SSL parameters?

Retrying forever is a customer concern: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1678197741759369?thread_ts=1677783523.194569&cid=C9TGBJB44

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with using retry.go here and adding observability, but I think this should be a permanent error. Changefeeds will retry forever if they see retryable errors right? If this code retries 10 times and fails, the changefeed should probably fail. For example, what if they get a 401 due to incorrect SSL parameters?

Retrying forever is a customer concern: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1678197741759369?thread_ts=1677783523.194569&cid=C9TGBJB44

401 may very well happen, but why is schema registry should be treated differently from say kafka:// URI? or webhook?
We retry forever in those cases. I think consistency here is better.
Furthermore, customer concerns around retries forever could be better resolved 1. by adding better tools to validate sink and schema registry validation (btw... do we validate schema registry when we create changefeed? -- that would be a good place to fail). 2. Customers should monitor stats, including retry rates, and alert on those.

@jayshrivastava jayshrivastava force-pushed the schemareg branch 3 times, most recently from 2392cb4 to b38eb17 Compare March 8, 2023 20:39
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/ccl/changefeedccl/schema_registry.go line 228 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I'm ok with using retry.go here and adding observability, but I think this should be a permanent error. Changefeeds will retry forever if they see retryable errors right? If this code retries 10 times and fails, the changefeed should probably fail. For example, what if they get a 401 due to incorrect SSL parameters?

Retrying forever is a customer concern: https://cockroachlabs.slack.com/archives/C9TGBJB44/p1678197741759369?thread_ts=1677783523.194569&cid=C9TGBJB44

401 may very well happen, but why is schema registry should be treated differently from say kafka:// URI? or webhook?
We retry forever in those cases. I think consistency here is better.
Furthermore, customer concerns around retries forever could be better resolved 1. by adding better tools to validate sink and schema registry validation (btw... do we validate schema registry when we create changefeed? -- that would be a good place to fail). 2. Customers should monitor stats, including retry rates, and alert on those.

Okay we can keep it retryable. I think a small number of attempts (5) for this loop is fine. We retry on a higher level (retry.go) with a max backoff of 10 mins anyways.


pkg/ccl/changefeedccl/schema_registry.go line 234 at r2 (raw file):

		}
		r.sliMetrics.SchemaRegistryRetries.Inc(1)
		log.VInfof(ctx, 1, "retrying schema registry operation: %s", err.Error())

Turns out we had logging the whole time.

@jayshrivastava
Copy link
Contributor Author

bors r+

@craig craig bot merged commit 6f444d4 into cockroachdb:master Mar 9, 2023
@craig
Copy link
Contributor

craig bot commented Mar 9, 2023

Build succeeded:

@blathers-crl
Copy link

blathers-crl bot commented Mar 9, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 972b85e to blathers/backport-release-21.2-98135: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


error creating merge commit from 972b85e to blathers/backport-release-22.1-98135: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


error creating merge commit from 972b85e to blathers/backport-release-22.2-98135: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

jayshrivastava added a commit to jayshrivastava/cockroach that referenced this pull request Mar 9, 2023
This reverts commit 972b85e from cockroachdb#98135.
The above commit has a maflormed commit message. This commit will be reapplied subsequently.

Epic: None
Release Note: None
craig bot pushed a commit that referenced this pull request Mar 9, 2023
96321:  sql: vectorized data encoding package r=yuzefovich a=cucaroach

colenc is a new package that allows kv.Batch's to be produced to encode
tables using coldata.Batch's as the input. Every attempt was made to
avoid code duplication and delegate to row encoding code where possible.

Epic: CRDB-18892
Informs: #91831
Release note: None

98312: ui: make upgrade error message encompass more tables r=xinhaoz a=xinhaoz

Previously, we only show the upgrade error message when querying using the sql api in the cases where the server error message says either a column does not exist or the specific table `crdb_internal.txn_cluster_insights`. This commit expands that to any table.

Epic: none

Release note (ui change): users will see the upgrade error message when a response from the sql api says that a relation or column does not exist

98338: cdc: reapply commit with malformed message r=samiskin a=jayshrivastava

### cdc: revert commit with malformed message
This reverts commit 972b85e from #98135.
The above commit has a maflormed commit message. This commit will be reapplied subsequently.

Epic: None
Release Note: None

---

### cdc: copy request body when registering schemas

Note: This commit message reapplies the commit 972b85e,
which was previously reverted for having an incorrect commit message.

Previously, when the schema registry encountered an error when
registering a schema, it would retry the request. The problem
is that upon hitting an error, we clean the body before retrying.
Retrying with an empty body results in a obscure error message.
With this change, we now retry with the original request body
so the original error is sustained.

This change also adds the metric changefeed.schema_registry.retry_count
which is a counter for the number of retries performed by the schema
registry. Seeing nonzero values indicates that there is an issue
with contacting the schema registry and/or registering schemas.

Release note (ops change): A new metric changefeed.schema_registry.retry_count
is added. This measures the number of request retries performed when
sending requests to the schema registry. Observing a nonzero value may indicate
improper configuration of the schema registry or changefeed parameters.
Epic: None

Informs: https://github.com/cockroachlabs/support/issues/2131
Informs: https://github.com/cockroachlabs/support/issues/2129

Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
@shermanCRL shermanCRL changed the title cdc: copy request body when registering schemas cdc: copy request body when registering schemas, add schema registry retry metric Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants