Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: allow blocking job adoption via sentinel file #44786

Merged
merged 1 commit into from
Feb 6, 2020

Conversation

dt
Copy link
Member

@dt dt commented Feb 6, 2020

This change adds a check for a sentinel file, DISABLE_STARTING_BACKGROUND_JOBS,
in the first on-disk store directory to the job adoption loop. If it is found,
jobs will not be adopted by that node and current job executions are cancelled.

Operators can thus touch this file if a misbehaving job is causing problems
and otherwise preventing traditional job control that requires being able to
write to the database.

Release note (general change): background job execution is disabled on nodes where the file DISABLE_STARTING_BACKGROUND_JOBS exists in the first store directory providing an emergency tool for disabling job execution.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the store directory better than the external io directory? I can see arguments both ways.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @joshimhoff, and @spaskob)

@dt
Copy link
Member Author

dt commented Feb 6, 2020

I’m pretty sure it should not be in the external IO dir:
a) namespace: users can create files in that dir via upload but can’t delete them so it’d be bad if they uploaded this file
b) existence: external io dir can be disabled entirely
c) shared state: external io dir is supposed to be backed by nfs but I like that this as it is here let’s me take an individual node out / back in if needed

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @joshimhoff and @spaskob)

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modulo the broken test

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @joshimhoff and @spaskob)

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks David!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @joshimhoff and @spaskob)

@joshimhoff
Copy link
Collaborator

This is awesome! Thanks, David. Can't wait to try it out.

This change adds a check for a sentinel file, DISABLE_STARTING_BACKGROUND_JOBS,
in the first on-disk store directory to the job adoption loop. If it is found,
jobs will not be adopted by that node and current job executions are cancelled.

Operators can thus touch this file if a misbehaving job is causing problems
and otherwise preventing traditional job control that requires being able to
write to the database.

Release note (general change): background job execution is disabled on nodes where the file DISABLE_STARTING_BACKGROUND_JOBS exists in the first store directory providing an emergency tool for disabling job execution.
@dt
Copy link
Member Author

dt commented Feb 6, 2020

bors r+

craig bot pushed a commit that referenced this pull request Feb 6, 2020
44458: storage/engine: MVCC Metamorphic test suite, first phase r=itsbilal a=itsbilal

This PR adds a new test-only sub-package to engine, metamorphic,
which has one test, TestMeta, that generates and runs random MVCC
operations on rocksdb and pebble instances with default settings.

Future additions to this test suite could include:

- [x]  A "check" mode that takes an output file as input, parses it, runs the operations in that sequence, and compares output strings.
- [ ]  Diffing test output between rocksdb and pebble and failing if there's a difference
- [ ]  Adding support for more operations
- [ ]  Adding a "restart" operation that closes the engine and restarts a different kind of engine in the store directory, then confirming operations after that point generate the same output.

First-but-biggest part of #43762 .

Release note: None

44730: storage: fix some tests that were fooling themselves r=andreimatei a=andreimatei

A couple of tests wanted multiple write too old errors, but they were
running at the wrong timestamp and so they were really getting a single
one.
Also add a test showing a funky scenario where 1PC batches behave
differently from non-1PC.

Release note: None

44786: jobs: allow blocking job adoption via sentinel file r=dt a=dt

This change adds a check for a sentinel file, DISABLE_STARTING_BACKGROUND_JOBS,
in the first on-disk store directory to the job adoption loop. If it is found,
jobs will not be adopted by that node and current job executions are cancelled.

Operators can thus touch this file if a misbehaving job is causing problems
and otherwise preventing traditional job control that requires being able to
write to the database.

Release note (general change): background job execution is disabled on nodes where the file DISABLE_STARTING_BACKGROUND_JOBS exists in the first store directory providing an emergency tool for disabling job execution.

Co-authored-by: Bilal Akhtar <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: David Taylor <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 6, 2020

Build succeeded

@craig craig bot merged commit d19f12a into cockroachdb:master Feb 6, 2020
@dt dt deleted the block-adopt branch February 7, 2020 16:03
@pbardea
Copy link
Contributor

pbardea commented Mar 30, 2020

@dt does this close #44595?

craig bot pushed a commit that referenced this pull request Oct 31, 2022
90176: cli/start: unify code between `cockroach start` and `cockroach mt start-sql` r=andreimatei a=knz

Fixes #89974.
Fixes #90831.
Fixes #90524.

This PR merges the server initialization code between `cockroach start` and `cockroach mt start-sql`.

In doing so, it brings `cockroach mt start-sql` closer to what we expect from proper CockroachDB server processes:

- it fixes support for the (test-only)
  `COCKROACH_EXPERIMENTAL_LINEARIZABLE` env var. (from #4754)

- it adds a tracing span for the startup code. (from #8712!!)

- it properly supports `--listening-url-file`. (from #15468)

- it properly does sanitization of `--external-io-dir`. (from #19725)

- it sets the proper log severity level for gRPC. (from #20308)

- it reports the command-line and env config to logs. (from #21344)

- it prevents startup if there is a `_CRITICAL_ALERT.txt` file
  in the store directory. (from #42401)

- sets the umask for newly created file to remove "other" permission
  bits. This was a security team request originally. (from #44043)

- it recovers support for `DISABLE_STARTING_BACKGROUND_JOBS`. (from #44786)

- it sets `GOMAXPROCS` from current cgroup limits. (from #57390)

- it stops the server early if the storage disk is full. (from #66893)

- it fixes support for the `COCKROACH_DISABLE_SPAN_CONFIGS` config env
  var. (from #73876)

See the individual commit for details.



90660: sql: add contention_events to cluster_execution_insights r=j82w a=j82w

The original contention column will remain
to make query operations faster. The events
are being put into a json column because it's
possible there could be multiple blocking events
for a single statement. The json column avoids the complexity of adding another table and keeping it
in sync with the insights table.

The table can be joined with index_columns and tables to get the database name, table name, and index name that contention occurred on. This does not contain the blocking statement information, and the blocking fingerprint id.

closes: #88561

Release note (sql change): Adds contention_events
to cluster_execution_insights. This is used
to see which transaction is blocking the specific
statement.

90719: opgen: added a bool field in struct opgen.transition r=Xiang-Gu a=Xiang-Gu

This PR adds a bool field in struct opgen.transition that indicates whether it results from a `equiv(xx)` transition spec in the opgen file. It will be useful for a test where we need to find the inital status on a adding/dropping path. Without such a change, it can be problematic if we have a `equiv(xx)` spec as the first transition. E.g.
```
  ToAbsent(
    PUBLIC,
    equiv(VALIDATED),
    to(WRITE_ONLY),
    to(ABSENT),
  )
```
Without this change, the inital status will confusingly be `VALIDATED`, and the next status will be `PUBLIC`.
With this change, the initial status will be `PUBLIC`, and the next status will be `WRITE_ONLY`.

We also added some comments when we make transitions from the specs.

Epic: None
Release note: None

90865: sql: use bare name string of new pk  to compare with pk name when altering primary key r=chengxiong-ruan a=chengxiong-ruan

Fixes #90836

Release note (sql change): previously, the `DROP CONSTRAINT, ADD CONSTRAINT` in one trick to have a new primary key without moving old primary key to be a secondary index didn't work if the primary key name is a reserved SQL keyword. A `constraint already exists` error was returned. This patch fixed the bug, the trick now also works with primary key named as reserved keywords.

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: j82w <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants