Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: replace circuit breaker cancel goroutine with per-Replica registry #74707

Closed
tbg opened this issue Jan 11, 2022 · 0 comments · Fixed by #75224
Closed

kvserver: replace circuit breaker cancel goroutine with per-Replica registry #74707

tbg opened this issue Jan 11, 2022 · 0 comments · Fixed by #75224
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Jan 11, 2022

Touches #33007.

As of #71806, replica circuit breakers default to off since they use a simple
but inefficient mechanism to fail-fast requests that are inflight while the
breaker trips.

We want to improve on that by adding a registry of the cancel functions on Replica
and by letting the probe cancel everyone. This saves one goroutine per request and
should have a small perf impact once suitably optimized.

Prerequisite for #74705.

Describe the solution you'd like

type Replica struct {
  // ...
  breakerCancelsMu {
    syncutil.Mutex
    sl []func()
  }
}

func (r *Replica) sendWithRangeID(...) {
  ctx, cancel := ...
  r.breakerCancelsMu.Lock()
  r.breakerCancelsMu.sl = append(..., cancel)
  r.breakerCancelsMu.Unlock()
}
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 11, 2022
tbg added a commit to tbg/cockroach that referenced this issue Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the "slowness"
existing detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not)
refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal
was first inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is off via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).
Primarily though, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are

- UI work (cockroachdb#74713)
- Metrics (cockroachdb#74505)

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
craig bot pushed a commit that referenced this issue Jan 14, 2022
71806: kvserver: circuit-break requests to unavailable ranges r=erikgrinaker a=tbg

Fixes #33007.
Closes #61311.

This PR uses the circuit breaker package introduced in #73641 to address
issue #33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in #72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see #74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); #74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (#74713) and metrics (#74505).

The release note is deferred to #74705 (where breakers are turned on).

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
@tbg tbg self-assigned this Jan 14, 2022
tbg added a commit to tbg/cockroach that referenced this issue Jan 25, 2022
…egistry

Total WIP for cockroachdb#74707. Don't review until out of draft.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 31, 2022
…egistry

PR cockroachdb#71806 originally landed with a very simple solution for the problem
of joining the breaker's cancellation chain with that of the incoming
context: spawning a goroutine. This was never meant to be the solution
in prod (spawning goroutines is not cheap) and this PR switches to a
real solution (which may still have soom room for improvement).

Replica circuit breakers are still off (cockroachdb#74705) and so this change
should not change any behavior, unless breakers are turned on. The
provided benchmarks show both the "end-to-end" `Replica.Send` as well as
the micro-bench `breaker.{R,Un}egister` perf and contrast it with that
of having the breakers disabled.

This sets the stage for evaluating and, if necessary, minimizing the
overhead, which (along with some additional end-to-end testing) then
allows us to turn breakers on by default (cockroachdb#74705).

`BenchmarkReplicaCircuitBreakerSendOverhead` measures `Replica.Send` of
an no-result single-point read. We see that we lose a few percent,
though this benchmark is pretty micro already.

`BenchmarkReplicaCircuitBreaker_Register` is even more micro, measuring
the overhead of making a cancel, and calling `Register()` followed by
`Unregister()`. The "old" measurement is essentially just creating a
cancelable context (so the +XXXX% numbers don't mean a lot; we're doing
work for which there was previously no analog). Here we can clearly see
that sharding the mutex map can help, though note that this would
already only be visible at the `Replica.Send` level at much higher
concurrencies than in this test (16).

I also ran the kv95 roachtest with what corresponds to the `mutexmap-1`
strategy above (i.e. a single mutex); [raw data here]. In this
configuration (relative to disabling breakers) `kv95/enc=false/nodes=1`
and `kv95/enc=false/nodes=1/cpu=32` sustained a -1.62% hit (in qps,
median over five runs). Once we've chosen a suitable shard count, I
expect that to drop a bit as well, and think we can turn on breakers in
production, at least from a performance point of view.

[raw data here]: https://gist.github.com/tbg/44b53a10d18490a6dabd57e1d1d3225e

old = disabled, new = enabled:

```
ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16     1.63µs ± 4%    1.77µs ± 5%     +8.38%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16     1.66µs ± 4%    1.75µs ± 5%     +5.74%  (p=0.000 n=10+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16     1.63µs ± 6%    1.75µs ± 4%     +7.48%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16     1.58µs ± 7%    1.73µs ± 4%     +9.47%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16    1.57µs ± 4%    1.62µs ± 6%     +3.19%  (p=0.046 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16    1.56µs ± 5%    1.63µs ± 7%     +4.20%  (p=0.007 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16    1.53µs ± 5%    1.62µs ± 8%     +6.01%  (p=0.001 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16    1.55µs ± 7%    1.61µs ± 5%     +3.50%  (p=0.049 n=11+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16    1.57µs ± 7%    1.62µs ± 6%     +3.53%  (p=0.042 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16    1.53µs ± 7%    1.64µs ± 5%     +6.84%  (p=0.000 n=11+12)

name                                                old alloc/op   new alloc/op   delta
ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16     1.42kB ± 0%    1.44kB ± 0%     +1.25%  (p=0.000 n=7+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16     1.42kB ± 0%    1.44kB ± 0%     +1.19%  (p=0.000 n=9+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16     1.42kB ± 0%    1.43kB ± 0%     +1.14%  (p=0.000 n=11+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16     1.42kB ± 0%    1.43kB ± 0%     +1.24%  (p=0.000 n=9+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16    1.42kB ± 0%    1.43kB ± 0%     +1.08%  (p=0.000 n=11+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16    1.42kB ± 0%    1.43kB ± 0%     +1.14%  (p=0.000 n=10+10)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16    1.42kB ± 0%    1.43kB ± 0%     +1.12%  (p=0.000 n=10+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16    1.42kB ± 0%    1.43kB ± 0%     +1.09%  (p=0.000 n=11+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16    1.41kB ± 0%    1.43kB ± 0%     +1.13%  (p=0.000 n=9+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16    1.41kB ± 0%    1.43kB ± 0%     +1.23%  (p=0.000 n=9+12)

name                                                old allocs/op  new allocs/op  delta
ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16       11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16       11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16       11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16       11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
```

```
ReplicaCircuitBreaker_Register/X/mutexmap-1-16        30.8ns ±21%   787.6ns ± 2%  +2457.74%  (p=0.000 n=11+11)
ReplicaCircuitBreaker_Register/X/mutexmap-2-16        31.6ns ± 3%   782.0ns ± 2%  +2376.38%  (p=0.000 n=9+12)
ReplicaCircuitBreaker_Register/X/mutexmap-4-16        31.0ns ± 0%   778.8ns ± 2%  +2409.61%  (p=0.000 n=9+12)
ReplicaCircuitBreaker_Register/X/mutexmap-8-16        31.0ns ± 0%   775.8ns ± 1%  +2403.11%  (p=0.000 n=10+10)
ReplicaCircuitBreaker_Register/X/mutexmap-12-16       31.0ns ± 1%   288.8ns ± 2%   +833.06%  (p=0.000 n=10+12)
ReplicaCircuitBreaker_Register/X/mutexmap-16-16       31.1ns ± 1%   324.6ns ± 4%   +945.05%  (p=0.000 n=10+12)
ReplicaCircuitBreaker_Register/X/mutexmap-20-16       31.1ns ± 0%   193.3ns ± 2%   +522.26%  (p=0.000 n=10+12)
ReplicaCircuitBreaker_Register/X/mutexmap-24-16       31.0ns ± 0%   286.1ns ± 1%   +822.80%  (p=0.000 n=9+12)
ReplicaCircuitBreaker_Register/X/mutexmap-32-16       31.0ns ± 0%   194.4ns ± 2%   +527.12%  (p=0.000 n=10+12)
ReplicaCircuitBreaker_Register/X/mutexmap-64-16       31.0ns ± 1%   121.3ns ± 2%   +291.01%  (p=0.000 n=10+12)

name                                                old alloc/op   new alloc/op   delta
ReplicaCircuitBreaker_Register/X/mutexmap-1-16         80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-2-16         80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-4-16         80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-8-16         80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-12-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-16-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-20-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-24-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-32-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-64-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)

name                                                old allocs/op  new allocs/op  delta
ReplicaCircuitBreaker_Register/X/mutexmap-1-16          2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-2-16          2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-4-16          2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-8-16          2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-12-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-16-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-20-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-24-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-32-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-64-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
```

Created via:

```
$ ./dev bench ./pkg/kv/kvserver/ -f BenchmarkReplicaCircuitBreaker --count 10 | tee out.txt
$ for f in enabled=true enabled=false; do grep -F "${f}" out.txt | sed "s/${f}/X/" > "${f}"; done && benchstat 'enabled=false' 'enabled=true'
```

For the record, I also tried a `sync.Map`, but it was vastly inferior
for this use case as it is allocation-heavy and its writes are fairly
slow.

Touches cockroachdb#33007.
Fixes cockroachdb#74707.

Release note: None
craig bot pushed a commit that referenced this issue Feb 4, 2022
75224: kvserver: replace circuit breaker cancel goroutine with per-Replica registry r=erikgrinaker a=tbg

PR #71806 originally landed with a very simple solution for the problem
of joining the breaker's cancellation chain with that of the incoming
context: spawning a goroutine. This was never meant to be the solution
in prod (spawning goroutines is not cheap) and this PR switches to a
real solution (which may still have soom room for improvement).

Replica circuit breakers are still off (#74705) and so this change
should not change any behavior, unless breakers are turned on. The
provided benchmarks show both the "end-to-end" `Replica.Send` as well as
the micro-bench `breaker.{R,Un}egister` perf and contrast it with that
of having the breakers disabled.

This sets the stage for evaluating and, if necessary, minimizing the
overhead, which (along with some additional end-to-end testing) then
allows us to turn breakers on by default (#74705).

`BenchmarkReplicaCircuitBreakerSendOverhead` measures `Replica.Send` of
an no-result single-point read. We see that we lose a few percent,
though this benchmark is pretty micro already.

`BenchmarkReplicaCircuitBreaker_Register` is even more micro, measuring
the overhead of making a cancel, and calling `Register()` followed by
`Unregister()`. The "old" measurement is essentially just creating a
cancelable context (so the +XXXX% numbers don't mean a lot; we're doing
work for which there was previously no analog). Here we can clearly see
that sharding the mutex map can help, though note that this would
already only be visible at the `Replica.Send` level at much higher
concurrencies than in this test (16).

I also ran the kv95 roachtest with what corresponds to the `mutexmap-1`
strategy above (i.e. a single mutex); [raw data here]. In this
configuration (relative to disabling breakers) `kv95/enc=false/nodes=1`
and `kv95/enc=false/nodes=1/cpu=32` sustained a -1.62% hit (in qps,
median over five runs). Once we've chosen a suitable shard count, I
expect that to drop a bit as well, and think we can turn on breakers in
production, at least from a performance point of view.

[raw data here]: https://gist.github.com/tbg/44b53a10d18490a6dabd57e1d1d3225e

old = disabled, new = enabled:

```
ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16     1.63µs ± 4%    1.77µs ± 5%     +8.38%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16     1.66µs ± 4%    1.75µs ± 5%     +5.74%  (p=0.000 n=10+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16     1.63µs ± 6%    1.75µs ± 4%     +7.48%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16     1.58µs ± 7%    1.73µs ± 4%     +9.47%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16    1.57µs ± 4%    1.62µs ± 6%     +3.19%  (p=0.046 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16    1.56µs ± 5%    1.63µs ± 7%     +4.20%  (p=0.007 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16    1.53µs ± 5%    1.62µs ± 8%     +6.01%  (p=0.001 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16    1.55µs ± 7%    1.61µs ± 5%     +3.50%  (p=0.049 n=11+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16    1.57µs ± 7%    1.62µs ± 6%     +3.53%  (p=0.042 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16    1.53µs ± 7%    1.64µs ± 5%     +6.84%  (p=0.000 n=11+12)

name                                                old alloc/op   new alloc/op   delta
ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16     1.42kB ± 0%    1.44kB ± 0%     +1.25%  (p=0.000 n=7+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16     1.42kB ± 0%    1.44kB ± 0%     +1.19%  (p=0.000 n=9+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16     1.42kB ± 0%    1.43kB ± 0%     +1.14%  (p=0.000 n=11+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16     1.42kB ± 0%    1.43kB ± 0%     +1.24%  (p=0.000 n=9+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16    1.42kB ± 0%    1.43kB ± 0%     +1.08%  (p=0.000 n=11+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16    1.42kB ± 0%    1.43kB ± 0%     +1.14%  (p=0.000 n=10+10)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16    1.42kB ± 0%    1.43kB ± 0%     +1.12%  (p=0.000 n=10+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16    1.42kB ± 0%    1.43kB ± 0%     +1.09%  (p=0.000 n=11+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16    1.41kB ± 0%    1.43kB ± 0%     +1.13%  (p=0.000 n=9+11)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16    1.41kB ± 0%    1.43kB ± 0%     +1.23%  (p=0.000 n=9+12)

name                                                old allocs/op  new allocs/op  delta
ReplicaCircuitBreakerSendOverhead/X/mutexmap-1-16       11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-2-16       11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-4-16       11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-8-16       11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-12-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-16-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-20-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-24-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-32-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
ReplicaCircuitBreakerSendOverhead/X/mutexmap-64-16      11.0 ± 0%      12.0 ± 0%     +9.09%  (p=0.000 n=11+12)
```

```
ReplicaCircuitBreaker_Register/X/mutexmap-1-16        30.8ns ±21%   787.6ns ± 2%  +2457.74%  (p=0.000 n=11+11)
ReplicaCircuitBreaker_Register/X/mutexmap-2-16        31.6ns ± 3%   782.0ns ± 2%  +2376.38%  (p=0.000 n=9+12)
ReplicaCircuitBreaker_Register/X/mutexmap-4-16        31.0ns ± 0%   778.8ns ± 2%  +2409.61%  (p=0.000 n=9+12)
ReplicaCircuitBreaker_Register/X/mutexmap-8-16        31.0ns ± 0%   775.8ns ± 1%  +2403.11%  (p=0.000 n=10+10)
ReplicaCircuitBreaker_Register/X/mutexmap-12-16       31.0ns ± 1%   288.8ns ± 2%   +833.06%  (p=0.000 n=10+12)
ReplicaCircuitBreaker_Register/X/mutexmap-16-16       31.1ns ± 1%   324.6ns ± 4%   +945.05%  (p=0.000 n=10+12)
ReplicaCircuitBreaker_Register/X/mutexmap-20-16       31.1ns ± 0%   193.3ns ± 2%   +522.26%  (p=0.000 n=10+12)
ReplicaCircuitBreaker_Register/X/mutexmap-24-16       31.0ns ± 0%   286.1ns ± 1%   +822.80%  (p=0.000 n=9+12)
ReplicaCircuitBreaker_Register/X/mutexmap-32-16       31.0ns ± 0%   194.4ns ± 2%   +527.12%  (p=0.000 n=10+12)
ReplicaCircuitBreaker_Register/X/mutexmap-64-16       31.0ns ± 1%   121.3ns ± 2%   +291.01%  (p=0.000 n=10+12)

name                                                old alloc/op   new alloc/op   delta
ReplicaCircuitBreaker_Register/X/mutexmap-1-16         80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-2-16         80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-4-16         80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-8-16         80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-12-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-16-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-20-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-24-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-32-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-64-16        80.0B ± 0%     96.0B ± 0%    +20.00%  (p=0.000 n=11+12)

name                                                old allocs/op  new allocs/op  delta
ReplicaCircuitBreaker_Register/X/mutexmap-1-16          2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-2-16          2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-4-16          2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-8-16          2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-12-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-16-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-20-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-24-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-32-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
ReplicaCircuitBreaker_Register/X/mutexmap-64-16         2.00 ± 0%      3.00 ± 0%    +50.00%  (p=0.000 n=11+12)
```

Created via:

```
$ ./dev bench ./pkg/kv/kvserver/ -f BenchmarkReplicaCircuitBreaker --count 10 | tee out.txt
$ for f in enabled=true enabled=false; do grep -F "${f}" out.txt | sed "s/${f}/X/" > "${f}"; done && benchstat 'enabled=false' 'enabled=true'
```

For the record, I also tried a `sync.Map`, but it was vastly inferior
for this use case as it is allocation-heavy and its writes are fairly
slow.

Touches #33007.
Fixes #74707.

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in 5d16099 Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant