changefeedccl: Expire protected timestamps #97148

miretskiy · 2023-02-14T22:07:39Z

Changefeeds utilize protected timestamp system (PTS)
to ensure that the data targeted by changefeed is not
garbage collected prematurely. PTS record is managed
by running changefeed by periodically updating
PTS record timestamp, so that the data older than
the that timestamp may be GCed. However, if the
changefeed stops running when it is paused (either due
to operator action, or due to on_error=pause option,
the PTS record remains so that the changefeed can
be resumed at a later time. However, it is also possible
that operator may not notice that the job is paused for
too long, thus causing buildup of garbage data.

Excessive buildup of GC work is not great since it
impacts overall cluster performance, and, once GC can resume,
its cost is proportional to how much GC work needs to be done.
This PR introduces a new changefeed option
gc_protect_expires_after to automatically expire PTS records that
are too old. This automatic expiration is a safety mechanism
in case changefeed job gets paused by an operator or due to
an error, while holding onto PTS record due to protect_gc_on_pause
option.
The operator is still expected to monitor changefeed jobs,
and to restart paused changefeeds expediently. If the changefeed
job remains paused, and the underlying PTS records expires, then
the changefeed job will be canceled to prevent build up of GC data.

Epic: CRDB-21953
Informs #84598

Release note (enterprise change): Changefeed will automatically
expire PTS records for paused jobs if changefeed is configured
with gc_protect_expires_after option.

cockroach-teamcity · 2023-02-14T22:07:55Z

This change is

jayshrivastava

Reviewed 3 of 4 files at r1, 1 of 1 files at r2, 15 of 15 files at r3, 12 of 12 files at r4, 3 of 7 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @miretskiy)

-- commits line 8 at r2:
typo: processor

-- commits line 16 at r2:
typo: arise

pkg/ccl/changefeedccl/changefeed_processors.go line 1371 at r5 (raw file):

	recordID := progress.ProtectedTimestampRecord
	expiration := changefeedbase.PTSExpiresAfter.Get(&cf.flowCtx.Cfg.Settings.SV)

Maybe add an extra disable if zero check here. This will help in case the downstream code in reconciler.go changes.

pkg/jobs/registry.go line 1472 at r2 (raw file):

// WithJobMetrics returns a RegisterOption which will configure the job
// to use specified job metrics.

"Will configure jobs of this type to use specified metrics" is a bit more clear.

pkg/jobs/registry.go line 1900 at r3 (raw file):

func init() {
	//metricspoller.RegisterPeriodicClusterStatsCollector("paused-jobs", pollMetricsTask)

This init function should be deleted.

pkg/ts/catalog/chart_catalog.go line 3885 at r3 (raw file):

}

func jobTypeCharts(title string, varName string) chartDescription {

Nice.

pkg/jobs/jobs_test.go line 3507 at r3 (raw file):

	ctx := context.Background()
	// Make sure we set polling intervale before we start the server.

typo: interval

pkg/jobs/jobs_test.go line 3509 at r3 (raw file):

	// Make sure we set polling intervale before we start the server.
	// Otherwise, we might pick up the default value (30 second), which would make
	// this test slow.

good catch

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @dt, and @jayshrivastava)

pkg/ccl/changefeedccl/changefeed_processors.go line 1371 at r5 (raw file):