jobs: retry jobs with exponential backoff #66889

sajjadrizvi · 2021-06-25T16:53:22Z

This commit adds a mechanism to retry jobs with exponentially increasing
delays. This is achieved through two new columns in system.jobs table,
last_run and num_runs. In addition, this commit adds cluster settings
to control exponential-backoff parameters, initial delay and max delay,
with corresponding settings jobs.registry.retry.initial_delay and
jobs.registry.retry.max_delay. Finally, this commit adds a new
partial-index in the jobs table that improves the performance of periodic
queries run by registry in each node.

Release note (general change): The behavior for retrying jobs, which fail
due to a retriable error or due to job coordinator failure, is now delayed
using exponential backoff. Before this change, jobs which failed in a
retryable manner, would be resumed immediately on a different coordinator.
This change reduces the impact of recurrently failing jobs on the cluster.
This change adds two new cluster settings that control this behavior:
"jobs.registry.retry.initial_delay" and "jobs.registry.retry.max_delay",
which respectively control initial delay and maximum delay between
resumptions.

Fixes #44594
Fixes #65080

cockroach-teamcity · 2021-06-25T16:53:29Z

This change is

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sajjadrizvi)

pkg/jobs/adopt.go, line 51 at r3 (raw file):

      SET claim_session_id = $1, claim_instance_id = $2
    WHERE ((claim_session_id IS NULL)
      AND (status IN ` + claimableStatusTupleString + `))

Why these extra parens?

pkg/jobs/adopt.go, line 86 at r3 (raw file):

	// Select only those jobs that have their next retry time before the current time.
	// 2^62 is the max positive number.
	// $5 is the base in exponential backoff calculation, and $6 is the max retry delay.

I think your placeholders are off now. Also, $6 isn't the "base" the way we usually talk about bases in exponentials but rather the "starting value".

pkg/jobs/config.go, line 112 at r3 (raw file):

	retryBaseSetting = settings.RegisterDurationSetting(
		retryBaseSettingKey,
		"the base multiplier to calculate the exponential-backoff delay "+

nit: starting value. You could also supply a base if you wanted (it should be positive). Right now it's 2.

pkg/migration/migrations/retry_jobs_with_exponential_backoff.go, line 24 at r3 (raw file):

	ctx context.Context, _ clusterversion.ClusterVersion, d migration.TenantDeps,
) error {
	const query = `

The problem with this approach is that it might return before the schema change is done. Imagine that we start this migration and issue this transaction. It will create the job to do the schema change. Now, while the schema change is running, we crash. When we come back around, the schema change will still be running but the below commands will no-op and we'll return happily. This is not a good place to be. We need to make sure that the table really is changed.

There are a bunch of ways to do this. One way is to check to make sure that the jobs table does not have any running jobs and if it does, to wait for them. You could do that at the top of the loop. There are two basic approaches you could take to that task:

Use a descs.Txn to read the jobs table and check for the new columns or for a pending job
Use sql to do the same thing by using crdb_internal.pb_to_json

I don't have a strong preference.

pkg/migration/migrations/retry_jobs_with_exponential_backoff.go, line 26 at r3 (raw file):

	const query = `
BEGIN;
	ALTER TABLE IF EXISTS system.jobs

The table will definitely exist so nix the first IF EXISTS.

pkg/sql/catalog/systemschema/system.go, line 897 at r3 (raw file):

				StoreColumnNames:    []string{"claim_instance_id", "num_runs", "last_run"},
				StoreColumnIDs:      []descpb.ColumnID{9, 10, 11},
				KeySuffixColumnIDs:  []descpb.ColumnID{1}, //TODO(sajjad): To discuss: What should be the values here?

The values should be all of the columns of the primary key which are not in the KeyColumnIDs. In this case, 1.

ajwerner · 2021-07-07T16:04:43Z

pkg/sql/catalog/systemschema/system.go

@@ -954,8 +958,21 @@ var (
 				KeySuffixColumnIDs:  []descpb.ColumnID{1},
 				Version:             descpb.StrictIndexColumnIDGuaranteesVersion,
 			},
+			// TODO(sajjad): To discuss: Why are we creating another index instead of storing num_runs and last_runs in jobs_status_created_idx?


We wanted to create another index for two reasons. The main one is to make it a partial index. The other reason is to allow us to order it on claim_session_id

Speaking of which, this is missing the predicate.

ajwerner · 2021-07-07T17:11:03Z

pkg/migration/migrations/retry_jobs_with_exponential_backoff.go

+	const addColsQuery = `ALTER TABLE system.jobs
+			ADD COLUMN IF NOT EXISTS num_runs INT8 FAMILY claim,
+			ADD COLUMN IF NOT EXISTS last_run TIMESTAMP FAMILY claim;`
+
+	const addIndexQuery = `
+CREATE INDEX IF NOT EXISTS jobs_status_claim


I'm not so sure about these IF NOT EXISTS

ajwerner · 2021-07-07T17:11:19Z

pkg/migration/migrations/retry_jobs_with_exponential_backoff.go

+			ADD COLUMN IF NOT EXISTS last_run TIMESTAMP FAMILY claim;`
+
+	const addIndexQuery = `
+CREATE INDEX IF NOT EXISTS jobs_status_claim


Be careful about aligning the naming on the index.

sajjadrizvi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/jobs/adopt.go, line 51 at r3 (raw file):

Previously, ajwerner wrote…

Why these extra parens?

This is because when we intercept the query in the StatementFilter testing knob, the statement has these extra params even if we don't add them. As we compare the received statement with this const in TestRegistrySettingUpdate, the statements don't match if we don't add these extra params.

pkg/jobs/adopt.go, line 86 at r3 (raw file):

Previously, ajwerner wrote…

I think your placeholders are off now. Also, $6 isn't the "base" the way we usually talk about bases in exponentials but rather the "starting value".

Done.

pkg/migration/migrations/retry_jobs_with_exponential_backoff.go, line 24 at r3 (raw file):

Previously, ajwerner wrote…

The problem with this approach is that it might return before the schema change is done. Imagine that we start this migration and issue this transaction. It will create the job to do the schema change. Now, while the schema change is running, we crash. When we come back around, the schema change will still be running but the below commands will no-op and we'll return happily. This is not a good place to be. We need to make sure that the table really is changed.

There are a bunch of ways to do this. One way is to check to make sure that the jobs table does not have any running jobs and if it does, to wait for them. You could do that at the top of the loop. There are two basic approaches you could take to that task:

Use a descs.Txn to read the jobs table and check for the new columns or for a pending job

Use sql to do the same thing by using crdb_internal.pb_to_json

I don't have a strong preference.

As we sketched out together, we have taken the first approach.

pkg/migration/migrations/retry_jobs_with_exponential_backoff.go, line 40 at r5 (raw file):

Previously, ajwerner wrote…

Be careful about aligning the naming on the index.

Do you mean the ordering of the columns based on the schema in system.go? I didn't order them correctly.
I added IF NOT EXISTS to cater for transient failures that may happen after the index has been created but before completing the migration.

pkg/sql/catalog/systemschema/system.go, line 961 at r5 (raw file):

Previously, ajwerner wrote…

Speaking of which, this is missing the predicate.

Thanks for pointing out. I have added the predicate.

ajwerner · 2021-07-07T19:37:39Z

pkg/migration/migrations/retry_jobs_with_exponential_backoff.go

+}
+
+func hasBackoffIndex(jobsTable catalog.TableDescriptor) bool {
+	const (


func equalMessage(a, b proto.Message) error { aBytes, err := protoutil.Marshal(a) if err != nil { ... } bBytes, err := protoutil.Marshal(b) if err != nil { ... } if bytes.Equal(aBytes, bBytes) { return nil } return errors.Errorf("mismatch for messages of types %T&%T: %s", a, b, strings.Join(pretty.Diff(a, b), "\n") }

sajjadrizvi · 2021-08-13T19:53:37Z

TFTR.

bors r=ajwerner

craig · 2021-08-13T22:23:00Z

This PR was included in a batch that was canceled, it will be automatically retried

66889: jobs: retry jobs with exponential backoff r=ajwerner a=sajjadrizvi This commit adds a mechanism to retry jobs with exponentially increasing delays. This is achieved through two new columns in system.jobs table, last_run and num_runs. In addition, this commit adds cluster settings to control exponential-backoff parameters, initial delay and max delay, with corresponding settings `jobs.registry.retry.initial_delay` and `jobs.registry.retry.max_delay`. Finally, this commit adds a new partial-index in the jobs table that improves the performance of periodic queries run by registry in each node. Release note (general change): The behavior for retrying jobs, which fail due to a retriable error or due to job coordinator failure, is now delayed using exponential backoff. Before this change, jobs which failed in a retryable manner, would be resumed immediately on a different coordinator. This change reduces the impact of recurrently failing jobs on the cluster. This change adds two new cluster settings that control this behavior: "jobs.registry.retry.initial_delay" and "jobs.registry.retry.max_delay", which respectively control initial delay and maximum delay between resumptions. Fixes #44594 Fixes #65080 68212: colexec: add optimized versions of aggregate window functions r=DrewKimball a=DrewKimball **colexecwindow: add sliding window functionality to window framer** This commit adds a method `slidingWindowIntervals` to `windowFramer` operators that returns a set of `toAdd` intervals and a set of `toRemove` intervals, which indicate the rows that should be added to the current aggregation and those that should be removed, respectively. This will be used to implement the sliding window optimization for aggregate window functions such as `sum`. **colexecwindow: implement sliding window aggregator** This commit supplies a new operator, `slidingWindowAggregator`, which is used for any window aggregate functions that implement the `slidingWindowAggregateFunc` interface. Rather than aggregating over the entire window frame for each row, the `slidingWindowAggregator` operator aggregates over the rows that are in the current window frame but were not in the previous, and removes from the aggregation the rows that were in the previous window frame but not the current. This allows window aggregate functions to be evaluated in linear rather than quadratic time. **colexec: implement sliding window optimization for sum window function** This commit modifies the `sum` aggregate window function to implement the `slidingWindowAggregateFunc`, which allows it to be used in a sliding window context. This yields linear rather than quadratic scaling in the worst case, and allows the vectorized engine to meet or exceed parity with the row engine for `sum` window functions. **colexec: implement sliding window optimization for count window function** This commit modifies the count aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: implement sliding window optimization for average window function** This commit modifies the `average` aggregate operator to implement the `slidingWindowAggregateFunc` interface so that it can be used with the sliding window optimization. **colexec: optimize count_rows window function** This commit implements an optimized version of `count_rows` that calculates the size of the window frame as soon as the window frame is calculated. This means that most of the overhead for `count_rows` now comes from calculating the window frame, which is worst-case linear time (previously, the step to retrieve the size of the frame was quadratic, though with a small constant). **colexec: optimize min and max window functions with default exclusion** This commit modifies the 'min' and 'max' aggregate window functions to implement the `slidingWindowAggregateFunc` interface, which allows them to be used in a sliding window context. However, this is only usable when the window frame never shrinks - e.g. it always contains all rows from the previous frame. This commit also provides implementations of `min` and `max` for use when the window frame can shrink. The indices of the 'next best' minimum or maximum values are stored in a priority queue that is updated for each row. Using the priority queue allows the `min` and `max` operators to avoid fully aggregating over the window frame even when the previous best value goes out of scope. Note that this implementation currently does not handle the case of non-default exclusion clause, in which case we must fall back to the quadratic approach. Fixes: #37039 Release note (performance improvement): The vectorized engine can now use the sliding-window approach to execute common aggregate functions as window functions. This allows aggregate window functions to be evaluated in linear rather than quadratic time. Currently, sum, count, average, min, and max are executed using this approach. 68433: sql: implemented placement restricted syntax for domiciling r=pawalt a=pawalt This PR combines the existing restricted placement zone config logic with the stubbed syntax to create an end-to-end PLACEMENT RESTRICTED implementation. Release note: None Note that the cluster setting for domiciling and telemetry will be added in a later PR. 68818: changefeedccl: mark avro format as no longer experimental r=[miretskiy,spiffyeng] a=HonoreDB The avro format for changefeeds now supports all column types and has been in production use for several releases. We'll now allow format=avro rather than format=experimental_avro The old string will remain supported because job payloads can persist across upgrades and downgrades. Release note (enterprise change): changefeed avro format no longer marked experimental Co-authored-by: Sajjad Rizvi <[email protected]> Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Peyton Walters <[email protected]> Co-authored-by: Aaron Zinger <[email protected]>

craig · 2021-08-14T03:49:43Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2021-08-14T07:12:59Z

Build succeeded:

GitHub CI (Cockroach)

andreimatei · 2021-08-18T19:05:01Z

This patch made TestFailureToMarkCanceledReversalLeadsToCanceledStatus take 2m (from being very fast). Perhaps other tests slowed down too?
Do you mind seeing if that's expected / if anything can be done to the test and, if not, at the very least skip.UnderShort() the test?

This test was made slow by cockroachdb#66889. Release note: None

ajwerner · 2021-08-18T19:10:34Z

This patch made TestFailureToMarkCanceledReversalLeadsToCanceledStatus take 2m (from being very fast). Perhaps other tests slowed down too?
Do you mind seeing if that's expected / if anything can be done to the test and, if not, at the very least skip.UnderShort() the test?

Thanks for the heads up! #69103

68595: ui,admission: observability improvements for admission control r=sumeerbhola a=sumeerbhola - Trace statements for latency incurred in admission queues. - Certain admission control metrics are now included in the overload dashboard. Specifically, - Resource bottlenecks can be identified using the "KV Admission Slots" and "KV Admission IO Tokens Exhausted Duration Per Second" graphs. - The rate at which admission control is admitting requests is in the "Admission Work Rate" graphs and the corresponding latency rate (for all requests) is in "Admission Latency Rate". Dividing the latter by the former gives the mean admission latency. - The 75th percentile latency for those requests that actually waited for admission is in the "Admission Latency: 75th percentile" graph. When admission control is off most of these graphs will be empty or zero, and the total KV admission slots will be 1. Informs #65955 Release note (ui change): admission control metrics are added to Overload dashboard. 69103: sql: make TestFailureToMarkCanceledReversalLeadsToCanceledStatus faster r=sajjadrizvi a=ajwerner This test was made slow by #66889. Release note: None Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

andreimatei · 2021-08-19T15:02:15Z

Thanks! Maybe check that no other tests slowed similarly by comparing package run times; I haven't done that.

…ation Schema change upgrade migrations to system tables are made idempotent by checking that the descriptor reaches some expected state. In order to ensure that it is in that expected state, some volatile fields need to be masked. We forgot to mask CreatedAtNanos. We also lost the testing which came with these helper functions we use. The vast majority of this PR is reviving testing from cockroachdb#66889. Fixes cockroachdb#85228. Release justification: Import bug fix for backport Release note (bug fix): Some upgrade migrations perform schema changes on system tables. Those upgrades which added indexes could, in some cases, get caught retrying because they failed to detect that the migration had already occurred due to the existence of a populated field. When that happens, the finalization of the new version can hang indefinitely and require manual intervention. This bug has been fixed.

87027: streamingccl: reduce server count in multinode tests r=samiskin a=samiskin While these tests would pass under stress locally they would fail CI stress, which may be because we were starting more server processes than ever before with 4 source nodes, 4 source tenant pods, and 4 destination nodes. This PR reduces the node count to 3 (any lower and scatter doesn't correctly distribute ranges) and only starts a single tenant pod for the source cluster. Release justification: test-only change Release note: None 87412: cli,server: fix --sql-advertise-addr when --sql-addr is not specified r=a-robinson,ajwerner a=knz Fixes #87040. Informs #52266. cc `@a-robinson` Release justification: bug fix Release note (bug fix): The flag `--sql-advertise-addr` now properly works even when the SQL and RPC ports are shared (because `--sql-addr` was not specified). Note that this port sharing is a deprecated feature in v22.2. 87440: ui: update txn contention insights to use waiting txns as event r=ericharmeling a=ericharmeling This commit updates the transaction workload insights pages to use the waiting contended transaction as the primary contention event, rather than the blocking transaction. Fixes #87284. https://www.loom.com/share/383fec4297a74ec79d90e46f11def792 Release justification: bug fixes and low-risk updates to new functionality Release note: None 87462: upgrade/upgrades: allow CreatedAtNanos to be set when validating migration r=ajwerner a=ajwerner Schema change upgrade migrations to system tables are made idempotent by checking that the descriptor reaches some expected state. In order to ensure that it is in that expected state, some volatile fields need to be masked. We forgot to mask CreatedAtNanos. We also lost the testing which came with these helper functions we use. The vast majority of this PR is reviving testing from #66889. Fixes #85228. Release justification: Import bug fix for backport Release note (bug fix): Some upgrade migrations perform schema changes on system tables. Those upgrades which added indexes could, in some cases, get caught retrying because they failed to detect that the migration had already occurred due to the existence of a populated field. When that happens, the finalization of the new version can hang indefinitely and require manual intervention. This bug has been fixed. Co-authored-by: Shiranka Miskin <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

…ation Schema change upgrade migrations to system tables are made idempotent by checking that the descriptor reaches some expected state. In order to ensure that it is in that expected state, some volatile fields need to be masked. We forgot to mask CreatedAtNanos. We also lost the testing which came with these helper functions we use. The vast majority of this PR is reviving testing from #66889. Fixes #85228. Release justification: Import bug fix for backport Release note (bug fix): Some upgrade migrations perform schema changes on system tables. Those upgrades which added indexes could, in some cases, get caught retrying because they failed to detect that the migration had already occurred due to the existence of a populated field. When that happens, the finalization of the new version can hang indefinitely and require manual intervention. This bug has been fixed.

…ation Schema change upgrade migrations to system tables are made idempotent by checking that the descriptor reaches some expected state. In order to ensure that it is in that expected state, some volatile fields need to be masked. We forgot to mask CreatedAtNanos. We also lost the testing which came with these helper functions we use. The vast majority of this PR is reviving testing from cockroachdb#66889. Fixes cockroachdb#85228. Release justification: Import bug fix for backport Release note (bug fix): Some upgrade migrations perform schema changes on system tables. Those upgrades which added indexes could, in some cases, get caught retrying because they failed to detect that the migration had already occurred due to the existence of a populated field. When that happens, the finalization of the new version can hang indefinitely and require manual intervention. This bug has been fixed.

sajjadrizvi force-pushed the job_retry_backoff branch 3 times, most recently from d405c37 to 3747e06 Compare June 29, 2021 13:23

sajjadrizvi changed the title ~~[DNM] jobs: retry failed jobs with exponential backoff~~ jobs: retry failed jobs with exponential backoff Jun 29, 2021

sajjadrizvi requested review from ajwerner and a team June 29, 2021 13:26

sajjadrizvi marked this pull request as ready for review June 29, 2021 13:26

sajjadrizvi changed the title ~~jobs: retry failed jobs with exponential backoff~~ [WIP] jobs: retry failed jobs with exponential backoff Jun 29, 2021

sajjadrizvi marked this pull request as draft June 29, 2021 15:09

sajjadrizvi changed the title ~~[WIP] jobs: retry failed jobs with exponential backoff~~ [DNM] jobs: retry failed jobs with exponential backoff Jun 29, 2021

sajjadrizvi force-pushed the job_retry_backoff branch 6 times, most recently from 315bcea to 6111368 Compare June 29, 2021 21:27

ajwerner reviewed Jun 30, 2021

View reviewed changes

sajjadrizvi force-pushed the job_retry_backoff branch 3 times, most recently from 0472f20 to 1745705 Compare July 7, 2021 15:56

ajwerner reviewed Jul 7, 2021

View reviewed changes

sajjadrizvi commented Jul 7, 2021

View reviewed changes

ajwerner reviewed Jul 7, 2021

View reviewed changes

sajjadrizvi force-pushed the job_retry_backoff branch 6 times, most recently from 3e0a028 to d65f99c Compare July 15, 2021 02:14

sajjadrizvi force-pushed the job_retry_backoff branch from 98c4b6a to fa98ca1 Compare August 13, 2021 17:54

craig bot merged commit 5dc07da into cockroachdb:master Aug 14, 2021

stevendanna mentioned this pull request Aug 17, 2021

jobs: exponential backoff defaults inappropriate for long-running changefeed jobs #69037

Closed

sajjadrizvi mentioned this pull request Aug 18, 2021

sql: add columns in jobs virtual table for overview in DBConsole #68995

Merged

ajwerner mentioned this pull request Aug 18, 2021

sql: make TestFailureToMarkCanceledReversalLeadsToCanceledStatus faster #69103

Merged

ajwerner added a commit to ajwerner/cockroach that referenced this pull request Aug 18, 2021

sql: make TestFailureToMarkCanceledReversalLeadsToCanceledStatus faster

8d35a4c

This test was made slow by cockroachdb#66889. Release note: None

ajwerner mentioned this pull request Aug 30, 2021

jobs: jitter exponential backoff #69580

Closed

jseldess mentioned this pull request Sep 8, 2021

jobs: retry jobs with exponential backoff cockroachdb/docs#11272

Closed

ajwerner mentioned this pull request Sep 6, 2022

upgrade/upgrades: allow CreatedAtNanos to be set when validating migration #87462

Merged

blathers-crl bot mentioned this pull request Sep 8, 2022

release-22.2: upgrade/upgrades: allow CreatedAtNanos to be set when validating migration #87623

Merged

ajwerner mentioned this pull request Sep 8, 2022

release-22.1: upgrade/upgrades: allow CreatedAtNanos to be set when validating migration #87633

Merged

yuzefovich mentioned this pull request Nov 4, 2022

stmtdiagnostics: add min exec latency condition #72491

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs: retry jobs with exponential backoff #66889

jobs: retry jobs with exponential backoff #66889

sajjadrizvi commented Jun 25, 2021 •

edited

Loading

cockroach-teamcity commented Jun 25, 2021

ajwerner left a comment

ajwerner Jul 7, 2021

ajwerner Jul 7, 2021

ajwerner Jul 7, 2021

ajwerner Jul 7, 2021

sajjadrizvi left a comment

ajwerner Jul 7, 2021

sajjadrizvi commented Aug 13, 2021

craig bot commented Aug 13, 2021

craig bot commented Aug 14, 2021

craig bot commented Aug 14, 2021

andreimatei commented Aug 18, 2021

ajwerner commented Aug 18, 2021

andreimatei commented Aug 19, 2021

jobs: retry jobs with exponential backoff #66889

jobs: retry jobs with exponential backoff #66889

Conversation

sajjadrizvi commented Jun 25, 2021 • edited Loading

cockroach-teamcity commented Jun 25, 2021

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner Jul 7, 2021

Choose a reason for hiding this comment

ajwerner Jul 7, 2021

Choose a reason for hiding this comment

ajwerner Jul 7, 2021

Choose a reason for hiding this comment

ajwerner Jul 7, 2021

Choose a reason for hiding this comment

sajjadrizvi left a comment

Choose a reason for hiding this comment

ajwerner Jul 7, 2021

Choose a reason for hiding this comment

sajjadrizvi commented Aug 13, 2021

craig bot commented Aug 13, 2021

craig bot commented Aug 14, 2021

craig bot commented Aug 14, 2021

andreimatei commented Aug 18, 2021

ajwerner commented Aug 18, 2021

andreimatei commented Aug 19, 2021

sajjadrizvi commented Jun 25, 2021 •

edited

Loading