Add command line flag that causes job system to not pick up any jobs #44595

joshimhoff · 2020-01-31T14:32:21Z

Is your feature request related to a problem? Please describe.
This bug leads to panics when users run the IMPORT INTO job on 19.2.2: #44252.

The impact can be very high. See this graph of the SQL prober error rate:

50-100% error rate for 1hr!

The nodes crash at a fast enough rate that (a) the cluster is more or less entirely unavailable to the customer for the duration of the incident and (b) it is hard for an operator to get a SQL connection that lives long enough to cancel the problematic jobs (this is why it takes around 1hr to mitigate).

How can we reduce impact / make it easier to mitigate this issue?

If a job fails, the job system could do an exponential backoff.
If a job fails repeatedly and the job system detects that the failures are caused by dying CRDB nodes, the job system could mark the job as a "job of death" and not retry it.
If an operator passes a command line flag to CRDB, the job system could not pick up any jobs.

This bug tracks 3 only.

I'm suggesting concrete solutions but I am more interested in improving the problem of very high impact than anything else! I'm suggesting concrete solutions to get a conversation started.

Describe the solution you'd like
If an operator passes a command line flag to CRDB, the job system should not run any jobs. You could imagine a cluster setting for this functionality instead. The issue with that is that if a job is causing panics it can be very hard to get a SQL connection that lives long enough to issue the any SQL calls. @carloruiz can chime in on the difficulty of cancelling jobs during such an incident. This difficulty contributes directly to the length of the incident.

Describe alternatives you've considered
See 1, 2, and 3 from the above list.

@ajwerner @pbardea @spaskob @carloruiz @DuskEagle @chrisseto @vilterp @vladdy

dt · 2020-01-31T15:14:42Z

I like having this, and having it outside SQL / not require online interaction, but I might vote to have a file that, if present, blocks the job adoption system (and potentially cancels the contexts of locally running job functions?)

I'd prefer the file over a cli flag since you could touch or remove said file without a restart -- if you're in the weeds already (which you probably are if you need to use this), adding a forced-restart could further complicate things.

ajwerner · 2020-01-31T15:18:50Z

I like the file idea! I’d be okay with a flag that leads to the file being written at startup - that might make things easier in a k8s deployment.

joshimhoff · 2020-01-31T15:24:55Z

File sounds nice! FIle + send signal to tell CRDB to read file? We could easily script writing file and sending signal on the CC platform, so I don't feel a flag is needed. @vladdy's operator could even eventually support this op.

ajwerner · 2020-01-31T15:25:43Z

We don’t even need a signal, the job adoption loop could just check the file every time it goes around.

joshimhoff · 2020-01-31T15:26:18Z

Gotcha.

dt · 2020-03-31T02:37:21Z

#44786

joshimhoff added O-sre For issues SRE opened or otherwise cares about tracking. A-jobs labels Jan 31, 2020

pbardea mentioned this issue Mar 30, 2020

jobs: allow blocking job adoption via sentinel file #44786

Merged

dt closed this as completed Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add command line flag that causes job system to not pick up any jobs #44595

Add command line flag that causes job system to not pick up any jobs #44595

joshimhoff commented Jan 31, 2020 •

edited

Loading

dt commented Jan 31, 2020

ajwerner commented Jan 31, 2020

joshimhoff commented Jan 31, 2020

ajwerner commented Jan 31, 2020

joshimhoff commented Jan 31, 2020

dt commented Mar 31, 2020

Add command line flag that causes job system to not pick up any jobs #44595

Add command line flag that causes job system to not pick up any jobs #44595

Comments

joshimhoff commented Jan 31, 2020 • edited Loading

dt commented Jan 31, 2020

ajwerner commented Jan 31, 2020

joshimhoff commented Jan 31, 2020

ajwerner commented Jan 31, 2020

joshimhoff commented Jan 31, 2020

dt commented Mar 31, 2020

joshimhoff commented Jan 31, 2020 •

edited

Loading