Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: Improve protected timestamp handling #103539

Merged
merged 1 commit into from
May 19, 2023

Conversation

miretskiy
Copy link
Contributor

Changefeeds utilize protected timestamp (PTS) in order to ensure that the data is not garbage collected (GC) prematurely.

This subsystem underwent through many rounds of changes, resulting in an unintuitive, and potentially dangerous behavior.

This PR updates and improves PTS handling as follows. PR #97148 introduce capability to cancel jobs that hold on to stale PTS records. This PR expands this functionality to apply to all jobs -- not just paused jobs.

This is necessary because due to #90810, changefeeds will retry almost every error -- and that means that if a running changefeed jobs fails to make progress for very long time, it is possible that a PTS record will protect GC collection for many days, weeks, or even months.

To guard against this case, introduce a new cluster setting changefeed.protect_timestamp.max_age, which defaults to generous 4 days, to make sure that even if the explicit changefeed option gc_protect_expires_after was not specified, the changefeed will fail after changefeed.protect_timestamp.max_age if no progress is made.

The fail safe can be disabled by setting
changefeed.protect_timestamp.max_age to 0; Note, however, that doing so could result in stability issues once stale PTS record released.

In addition, this PR deprecates protect_data_from_gc_on_pause option. This option is not needed since we now employ "active protected timestamp" management (meaning: there is always a PTS record when running changefeed jobs), and the handling of this record is consistent for both running and paused jobs.

Fixes #103464

Release note (enterprise change): Introduce a new
changefeed.protect_timestamp.max_age setting (default 4 days), which will cancel running changefeed jobs if they fail to make forward progress for much time. This setting is used if the explicit gc_protect_expires_after option was not set. In addition, deprecate protect_data_from_gc_on_pause option. This option is no longer needed since changefeed jobs always protect data.

@miretskiy miretskiy requested review from a team as code owners May 17, 2023 19:42
@miretskiy miretskiy requested review from adityamaru and samiskin and removed request for a team May 17, 2023 19:42
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miretskiy miretskiy force-pushed the metrics branch 2 times, most recently from d4a2504 to a90f15c Compare May 17, 2023 22:06
var MaxProtectedTimestampAge = settings.RegisterDurationSetting(
settings.TenantWritable,
"changefeed.protect_timestamp.max_age",
"fail the changefeed is the protected timestamp age exceeds this threshold; 0 disables expiration",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"fail the changefeed is the protected timestamp age exceeds this threshold; 0 disables expiration",
"fail the changefeed if the protected timestamp age exceeds this threshold; 0 disables expiration",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done; FYI: I also updated changefeed_stmt warning message to produce better warning when this setting is too low.

func (s StatementOptions) DeprecationWarnings() []string {
if newFormat, ok := NoLongerExperimental[s.m[OptFormat]]; ok {
return []string{fmt.Sprintf(`%[1]s is no longer experimental, use %[2]s=%[1]s`,
newFormat, OptFormat)}
}
for retiredOpt := range RetiredOptions {
if _, isSet := s.m[retiredOpt]; isSet {
return []string{fmt.Sprintf("%s option is no longer needed", retiredOpt)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice if we explained what the current behaviour is now that the option is not needed, in this case it always being enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will have to be done on the docs side; based on the PR description.
(otherwise, we'll have to add more configurations for each option we decide to retire.

@miretskiy miretskiy force-pushed the metrics branch 2 times, most recently from f5eb1bf to 32ea568 Compare May 19, 2023 14:05
Changefeeds utilize protected timestamp (PTS) in order to ensure
that the data is not garbage collected (GC) prematurely.

This subsystem underwent through many rounds of changes, resulting
in an unintuitive, and potentially dangerous behavior.

This PR updates and improves PTS handling as follows.
PR cockroachdb#97148 introduce capability to cancel jobs that hold on to stale
PTS records.  This PR expands this functionality to apply to all
jobs -- not just paused jobs.

This is necessary because due to cockroachdb#90810, changefeeds will retry almost
every error -- and that means that if a running changefeed jobs fails
to make progress for very long time, it is possible that a PTS record
will protect GC collection for many days, weeks, or even months.

To guard against this case, introduce a new cluster setting
`changefeed.protect_timestamp.max_age`, which defaults to generous 4
days, to make sure that even if the explicit changefeed option
`gc_protect_expires_after` was not specified, the changefeed will fail
after `changefeed.protect_timestamp.max_age` if no progress is made.
This setting only applies to newly created changefeeds.
Use `ALTER CHANGEFEED` statement to set `gc_protect_expires_after` option
for existing changefeeds to enable PTS expiration.

The fail safe can be disabled by setting
`changefeed.protect_timestamp.max_age` to 0; Note, however, that doing
so could result in stability issues once stale PTS record released.

In addition, this PR deprecates `protect_data_from_gc_on_pause` option.
This option is not needed since we now employ "active protected
timestamp" management (meaning: there is always a PTS record when
running changefeed jobs), and the handling of this record is consistent
for both running and paused jobs.

Fixes cockroachdb#103464

Release note (enterprise change): Introduce a new
`changefeed.protect_timestamp.max_age` setting (default 4 days), which
will cancel running changefeed jobs if they fail to make forward
progress for much time.  This setting is used if the explicit
`gc_protect_expires_after` option was not set.  In addition, deprecate
`protect_data_from_gc_on_pause` option.  This option is no longer needed
since changefeed jobs always protect data.
@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 19, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

changefeedccl: notify that data will not be protected on pause
3 participants