Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: remove special-casing for lease index 0 in refreshProposalsLocked #74711

Closed
tbg opened this issue Jan 11, 2022 · 0 comments · Fixed by #74810
Closed

kvserver: remove special-casing for lease index 0 in refreshProposalsLocked #74711

tbg opened this issue Jan 11, 2022 · 0 comments · Fixed by #74810
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Jan 11, 2022

Touches #33007.

Is your feature request related to a problem? Please describe.

We have this code that essentially nukes any lease-applied-index-zero proposal that shows up in a refresh:

if p.command.MaxLeaseIndex == 0 {
// Commands without a MaxLeaseIndex cannot be reproposed, as they might
// apply twice. We also don't want to ask the proposer to retry these
// special commands.
r.cleanupFailedProposalLocked(p)
log.VEventf(p.ctx, 2, "refresh (reason: %s) returning AmbiguousResultError for command "+
"without MaxLeaseIndex: %v", reason, p.command)
p.finishApplication(ctx, proposalResult{Err: roachpb.NewError(
roachpb.NewAmbiguousResultError(
fmt.Sprintf("unknown status for command without MaxLeaseIndex "+
"at refreshProposalsLocked time (refresh reason: %s)", reason)))})
continue
}

The only such proposals are in fact leases:

// Request a new max lease applied index for any request that isn't itself
// a lease request. Lease requests don't need unique max lease index values
// because their max lease indexes are ignored. See checkForcedErr.
if !p.Request.IsLeaseRequest() {
b.assignedLAI++
}
lai := b.assignedLAI

Some problems with terminating these proposals early:

  • the proposal's age isn't checked, so this might nuke a perfectly good lease proposal that was literally just inserted
  • it's getting in the way in kvserver: circuit-break requests to unavailable ranges #71806: we are trying to use the existence of old proposals as a trigger for the per-replica circuit breaker, but if there isn't a lease, then the only proposal is going to be a lease, and it's going to be terminated every ~3s which is much more aggressive than the timeout we're setting for what it means to be "old". So the breaker never trips but it should. (This is why I am filing this issue)

Describe the solution you'd like

I believe that leases have their own replay protection through the lease
sequence number. There is a surprising amount of logic in the lease code and so
I can't be entirely sure. Perhaps there are issues where multiple leases might
be proposed with the same sequence number (due to being Equivalent) but get
reordered? Equivalent is not commutative and it checks that the "later" lease
has a longer expiration (in the case of expiration-based leases) so there it
seems fine. And for epoch-based leases, basically only the epoch and start time
matter, which are always the same, so there isn't a problem either.

If we believe lease-related requests can safely be reproposed, since they are
the only requests special-cased in this way in refreshProposalsLocked, the
simplest solution would be to remove the special casing, or to defer their
removal until the breaker is tripped.

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 11, 2022
tbg added a commit to tbg/cockroach that referenced this issue Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the "slowness"
existing detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not)
refreshed (cockroachdb#74711) and we need to remember the ticks at which a proposal
was first inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is off via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).
Primarily though, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are

- UI work (cockroachdb#74713)
- Metrics (cockroachdb#74505)

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 11, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see cockroachdb#74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); cockroachdb#74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
tbg added a commit to tbg/cockroach that referenced this issue Jan 13, 2022
Fixes cockroachdb#33007.
Closes cockroachdb#61311.

This PR uses the circuit breaker package introduced in cockroachdb#73641 to address
issue cockroachdb#33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within the timeout specified by the
`kv.replica_circuit_breaker.slow_replication_threshold` cluster setting,
the breaker is tripped, cancelling all inflight and future requests
until the breaker heals. Perhaps surprisingly, the existing "slowness"
detection (the informational "have been waiting ... for proposal"
warning in `executeWriteBatch`) was moved deeper into replication
(`refreshProposalsLocked`), where it now trips the breaker.  This has
the advantage of providing a unified path for lease requests (which
don't go through `executeWriteBatch`) and pipelined writes (which return
before waiting on the inflight replication process). To make this work,
we need to pick a little fight with how leases are (not) refreshed
(cockroachdb#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

Perhaps surprisingly, when the breaker is tripped, *all* traffic to
the Replica gets the fail-fast behavior, not just mutations. This is
because even though a read may look good to serve based on the lease,
we would also need to check for latches, and in particular we would
need to fail-fast if there is a transitive dependency on any write
(since that write is presumably stuck). This is not trivial and so
we don't do it in this first iteration (see cockroachdb#74799).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in cockroachdb#72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, the cluster setting defaults to zero (disabling the
entire mechanism) until some necessary follow-up items have been
addressed (see cockroachdb#74705). For example, the breaker-sensitive context
cancelation is a toy implementation that comes with too much of a
performance overhead (one extra goroutine per request); cockroachdb#74707 will
address that.  Other things not done here that we certainly want in the
22.1 release are UI work (cockroachdb#74713) and metrics (cockroachdb#74505).

The release note is deferred to cockroachdb#74705 (where breakers are turned on).

Release note: None

touchie
craig bot pushed a commit that referenced this issue Jan 14, 2022
71806: kvserver: circuit-break requests to unavailable ranges r=erikgrinaker a=tbg

Fixes #33007.
Closes #61311.

This PR uses the circuit breaker package introduced in #73641 to address
issue #33007: When a replica becomes unavailable, it should eagerly
refuse traffic that it believes would simply hang.

Concretely, every request to the Replica gets a cancelable context that
is sensitive to the circuit breaker tripping (relying on context
cancellation makes sure we don't end up with a second channel that needs
to be plumbed to everyone and their dog; Go makes it really difficult to
join two cancellation chains); if the breaker is tripped when the
request arrives, it is refused right away. In either case, the outgoing
error is augmented to carry information about the tripped breaker. This
isn't 100% trivial, since retreating from in-flight replication
typically causes an `AmbiguousResultError`, and while we could avoid it
in some cases we can't in all. A similar problem occurs when the
canceled request was waiting on a lease, in which case the result is a
NotLeaseholderError.

For any request that made it in, if it enters replication but does not
manage to succeed within `base.SlowRequestThreshold` (15s at time of
writing), the breaker is tripped, cancelling all inflight and future
requests until the breaker heals. Perhaps surprisingly, the existing
"slowness" detection (the informational "have been waiting ... for
proposal" warning in `executeWriteBatch`) was moved deeper into
replication (`refreshProposalsLocked`), where it now trips the breaker.
This has the advantage of providing a unified path for lease requests
(which don't go through `executeWriteBatch`) and pipelined writes (which
return before waiting on the inflight replication process). To make this
work, we need to pick a little fight with how leases are (not) refreshed
(#74711) and we need to remember the ticks at which a proposal was first
inserted (despite potential reproposals).

A tripped breaker deploys a background probe which sends a
`ProbeRequest` (introduced in #72972) to determine when to heal; this is
roughly the case whenever replicating a command through Raft is possible
for the Replica, either by appending to the log as the Raft leader, or
by forwarding to the Raft leader. A tiny bit of special casing allows
requests made by the probe to bypass the breaker.

As of this PR, all of this is opt-in via the
`kv.replica_circuit_breakers.enabled` cluster setting while we determine
which of the many follow-up items to target for 22.1 (see #74705).  For
example, the breaker-sensitive context cancelation is a toy
implementation that comes with too much of a performance overhead (one
extra goroutine per request); #74707 will address that.

Other things not done here that we certainly want in the 22.1 release
are UI work (#74713) and metrics (#74505).

The release note is deferred to #74705 (where breakers are turned on).

Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
@tbg tbg self-assigned this Jan 14, 2022
craig bot pushed a commit that referenced this issue Jan 24, 2022
74810: kvserver: remove unnecessary special casing of lease for LAI r=nvanbenschoten a=tbg

Touches #33007 (this simplifies #71806).

We were previously "not" assigning a LeaseAppliedIndex to lease
request proposals. But we were doing so very half-heartedly: instead of
always assigning zero, we assigned "whatever was assigned to the
previous command". This meant that in practice, the first lease on a
range would get zero, and any subsequent leases would get some nonzero
numbers. This wasn't particularly principled and raises eyebrows. For
testing convenience and clarity it is helpful to assign a zero MLAI
to leases, which is what this commit does.

We then turn to the special-casing in refreshProposalsLocked.

`(*Replica).refreshProposalsLocked` previously refused to repropose
commands that had a zero `AppliedLeaseIndex` (LAI). This was done on
grounds of needing to provide replay protection for these requests.
Upon inspection it became clear that a zero MLAI could only ever apply
to lease requests; they are the only proposals with an exception, and as
we saw above it would not usually result in a zero, but could (and
especially in testing, where most leases are the first lease on their
respective range).

We [discussed] this internally and concluded that leases can in fact be
reproposed, since they have their own replay protection mediated through
the lease's sequence number. This is great since as stated above, we did
in fact repropose (some) leases and have for years.

This commit removes the special casing.

Fixes #74711.

As suggested by @nvanbenschoten I ran the `kvserver` tests with this
diff that reproposes every lease request:

```diff
diff --git a/pkg/kv/kvserver/replica_proposal_buf.go b/pkg/kv/kvserver/replica_proposal_buf.go
index 95765d8efd..1b66745866 100644
--- a/pkg/kv/kvserver/replica_proposal_buf.go
+++ b/pkg/kv/kvserver/replica_proposal_buf.go
@@ -576,6 +576,11 @@ func (b *propBuf) FlushLockedWithRaftGroup(
 			ents = append(ents, raftpb.Entry{
 				Data: p.encodedCommand,
 			})
+			if p.Request.IsLeaseRequest() {
+				ents = append(ents, raftpb.Entry{
+					Data: p.encodedCommand,
+				})
+			}
 		}
 	}
 	if firstErr != nil {

```

This inspired the subsequent commits.

[discussed]: https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1641559185021100

Release note: None

74923: sql: pre-split hash sharded indexes ranges before DELETE_AND_WRITE_ONLY r=chengxiong-ruan a=chengxiong-ruan

Fixes #74558

Pre-split ranges on shard boundaries before SchemaChanger move
new hash sharded indexes from DELETE_ONLY to DELETE_AND_WRITE_ONLY
state.

Release note (bug fix): None

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
@craig craig bot closed this as completed in 852a80c Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant