kvserver: prevent build-up of abandoned consistency checks #76855

tbg · 2022-02-21T15:35:07Z

We've seen in the events leading up to #75448 that a build-up of
consistency check computations on a node can severely impact node
performance. This commit attempts to address the main source of
that, while re-working the code for easier maintainability.

The way the consistency checker works is by replicating a command through
Raft that, on each Replica, triggers an async checksum computations
the results of which the caller collects via CollectChecksum requests
addressed to each Replica.

If for any reason, the caller does not wait to collect the checksums
but instead moves on to run another consistency check (perhaps on
another Range), these inflight computations can build up over time.

This was the main issue in #75448; we were accidentally canceling the
context on the leaseholder "right away", failing the consistency check
(but leaving it running on all other replicas), and moving on to the
next Range.
As a result, some (but with spread out leaseholders, ultimately all)
Replicas ended up with dozens of consistency check computations,
starving I/O and CPU. We "addressed" this by avoiding this errant ctx
cancellation (#75448 but longer-term #75656), but this isn't a holistic
fix yet.

In this commit, we make three main changes:

give the inflight consistency check computations a clean API, which
makes it much easier to understand "how it works".
when returning from CollectChecksum (either on success or error,
notably including context cancellation), cancel the corresponding
consistency check. This solves the problem, assuming that
CollectChecksum is reliably issued to each Replica.
reliably issue CollectChecksum to each Replica on which a computation
may have been triggered. When the caller's context is canceled, still
do the call with a one-second-timeout one-off Context which should be
good enough to make it to the Replica and short-circuit the call.

Release note: None

cockroach-teamcity · 2022-02-21T15:35:18Z

This change is

tbg · 2022-02-21T15:35:45Z

Please don't review yet (not sure if requesting reviews on a draft sends out notifications; I assume it does)

erikgrinaker · 2022-02-21T15:36:42Z

not sure if requesting reviews on a draft sends out notifications; I assume it does

It does. Will hold off.

Purely mechanical. Release note: None

Adding a comment while I'm there. Release note: None

Release note: None

We've seen in the events leading up to cockroachdb#75448 that a build-up of consistency check computations on a node can severely impact node performance. This commit attempts to address the main source of that, while re-working the code for easier maintainability. The way the consistency checker works is by replicating a command through Raft that, on each Replica, triggers an async checksum computations the results of which the caller collects via `CollectChecksum` requests addressed to each `Replica`. If for any reason, the caller does *not* wait to collect the checksums but instead moves on to run another consistency check (perhaps on another Range), these inflight computations can build up over time. This was the main issue in cockroachdb#75448; we were accidentally canceling the context on the leaseholder "right away", failing the consistency check (but leaving it running on all other replicas), and moving on to the next Range. As a result, some (but with spread out leaseholders, ultimately all) Replicas ended up with dozens of consistency check computations, starving I/O and CPU. We "addressed" this by avoiding this errant ctx cancellation (cockroachdb#75448 but longer-term cockroachdb#75656), but this isn't a holistic fix yet. In this commit, we make three main changes: - give the inflight consistency check computations a clean API, which makes it much easier to understand "how it works". - when returning from CollectChecksum (either on success or error, notably including context cancellation), cancel the corresponding consistency check. This solves the problem, *assuming* that CollectChecksum is reliably issued to each Replica. - reliably issue CollectChecksum to each Replica on which a computation may have been triggered. When the caller's context is canceled, still do the call with a one-second-timeout one-off Context which should be good enough to make it to the Replica and short-circuit the call. Release note: None

tbg · 2022-09-20T07:49:59Z

Properly fixed in #86883

tbg requested review from erikgrinaker and a team February 21, 2022 15:35

tbg added 5 commits February 22, 2022 09:09

kvserver: move some consistency related code

b4c0b1a

Purely mechanical. Release note: None

kvserver: move a const closer to only usage

ba02bbc

Adding a comment while I'm there. Release note: None

kvserver: unexport ReplicaChecksum

64e5ee9

Release note: None

kvserver: don't assign receiver in sha512

01a295e

Release note: None

tbg force-pushed the conscheck-rewrite branch from 65f8690 to 322dd14 Compare February 22, 2022 10:52

erikgrinaker mentioned this pull request Mar 7, 2022

kvserver: prevent build-up of abandoned consistency checks #77432

Closed

tbg closed this Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: prevent build-up of abandoned consistency checks #76855

kvserver: prevent build-up of abandoned consistency checks #76855

tbg commented Feb 21, 2022

cockroach-teamcity commented Feb 21, 2022

tbg commented Feb 21, 2022

erikgrinaker commented Feb 21, 2022

tbg commented Sep 20, 2022

kvserver: prevent build-up of abandoned consistency checks #76855

kvserver: prevent build-up of abandoned consistency checks #76855

Conversation

tbg commented Feb 21, 2022

cockroach-teamcity commented Feb 21, 2022

tbg commented Feb 21, 2022

erikgrinaker commented Feb 21, 2022

tbg commented Sep 20, 2022