storage: restarted node in need of snapshots can be wedged for long time #37906

tbg · 2019-05-29T09:32:35Z

repro steps:

set up a three node tpcc-2k cluster (any warehouse count will do, but with more data the problem is more obvious)
run tpcc with --tolerate-errors and --wait=false
kill one node, but leave everything else running
wait a few minutes (maybe 10? more if it doesn't repro)

Restart the dead node and run a SELECT COUNT(*) on the tpcc tables through it

Chances are this will hang for minutes before returning. The reason is that none of the replicas will have been moved off that node while it was down (since there are only three nodes in the cluster), but now all of them will need Raft snapshots. There is no particular ordering in which these snapshots are requested. Some of our COUNT(*) requests (or even worse, liveness requests etc) may hit a replica on the local node in need of a snapshot. The lease request will hang until the snapshot is applied, which can take a long time.

We should do something here, like short-circuit lease request (for a blank NotLeaseholderErr) if we can with reasonable accuracy detect that the node is in need of a snapshot. This isn't trivial to detect on a follower (would be easier on the leader, but alas). We could also prioritize snapshots for system ranges higher to resolve the somewhat orthogonal problem that such hung requests on the system range can prevent the node from becoming live in the first place, which causes hard-to-diagnose problems.

andy-kimball · 2019-06-26T14:43:10Z

Can the leader detect if one or more of the Raft followers is behind, and if so, delay Raft log truncation until the dead node timeout expires (i.e. 5 minutes)?

tbg · 2019-06-26T15:13:43Z

Raft log truncation at the time of writing takes into account node liveness in the sense that it will immediately become willing to "sacrifice" followers that are not live (i.e. after ~4-5s of downtime). As we discussed in the weekly, this amounts to a historical artifact (because we had to be very careful to not let the Raft log grow large historically due to its inclusion in snapshots) that isn't a concern any more as apparent from this issue the current behavior is undesirable.

I suggest the following heuristics:

if the Raft log is excessively large (say > 2x larger than a snapshot would be, and at least a few MB - maybe much more; tuning this is TBD), sacrifice the down followers (i.e. truncate the log). Also do this if the store is low on disk. Otherwise,
if there is any down follower, don't truncate (regardless of how long the node has been down).

The second step may seem unexpectedly simple, but I think it checks out. If we sacrifice followers the moment they're marked as dead they will still have most of their replicas (removing them requires sending snapshots elsewhere, so it'll go at a leisurely pace compared to how fast the log queue can truncate). A dead node's replicas are supposedly going to be transferred away completely over time OR the node will restart, at which point log truncation will become possible again (since no dead follower exists any more). And the first criterion protects us from amassing a prohibitively long Raft log. At the same time, it decouples the log queue from the store dead timeout, which is a nice property.

tbg · 2019-06-26T15:27:31Z

The other problem we have here is that replicas in need of a snapshot translate the latency of queueing for and finally receiving the snapshot to any clients waiting on the replica. If we can detect the situation in which a replica is waiting for a snapshot without a timeout, we're in a good place to prevent this from becoming a problem because we can refuse the requests with a NotLeaseholderError which should prompt DistSender to try another replica (eventually discovering and caching the true leaseholder). Ideally we can achieve this without digging into the Raft internals too much, though it's even unclear what the Raft internals provide. On a follower, we don't even receive the latest commit index, and we're also not really sure if we will receive a snapshot or not. But there will be some way we can reliably figure this out

However, one spicy problem is that when a request first arrives at such a range, the result will be a lease request proposed to Raft. This lease request's outcome will only be known after the snapshot arrives (and the replica can catch up all the way to the result of the lease request). As a result, if the replica actually succeeds in acquiring the lease (it may well if no other lease has been active since the replica went down), no other replica can serve traffic until the stuck replica receives its snapshot, effectively stalling that range for that duration (which can be many hours depending on the amount of data that needs to be transferred). We could "probe" Raft (with a timeout) before issuing any lease requests and could build a small state machine around the probes that can translate into a circuit breaker to solve the first problem above. "Probing" could either mean submitting a noop through Raft, or sending a one-off RPC to a quorum of replicas and noting the highest committed index, which would be required to be surpassed by that of the local replica before declaring the probe a success.

I think to defuse this issue to 80% we need to fix the log truncation and come up with a relatively reliable indicator for returning a NotLeaseholderError, falling back to a timeout on the lease request to prevent the worst case of requests getting stuck indefinitely.

The situation in which the down node gets a lease is hopefully rare because it needing a snapshot sort of implies that there was another leader in the meantime that wrote more log entries.

Recording the outcome of what @danhhz and I just discussed: we're prioritizing getting something in place for 19.1.3 which is going to be cut in ~ a week, so time is of the essence. We're conjecturing that we'll get most bang for our buck by relaxing the raft log queue truncation heuristics to avoid sacrificing dead followers until the size of the raft log gets bigger than a few mb (already true today). At worst this will lead to few-mb Raft logs which aren't a concern (for example, during IMPORT/RESTORE we regularly dozens of mbs in the Raft log, though concentrated in only a handful of large entries). This will achieve the absence of snapshots for any but the "hot" ranges (that write more than X mb during the downtime) where previously even moderately lukewarm ranges would require snapshots.

We'll get this landed on master ASAP and let it bake, and additionally get some more manual testing in using the steps in the first issue. Additionally we'll return a NotLeaseholderError after a timeout from getting the lease as a failsafe.

bdarnell · 2019-06-26T16:49:44Z

Could the quota pool cause this to escalate into even more severe failures? If the node is up and responding to heartbeats but can't accept new log entries until it gets a snapshot, it will prevent the release of proposal quota and then block writes even on the live nodes.

tbg · 2019-06-26T18:49:15Z

@bdarnell the quota pool handles this case, though we'll need to check that it actually works as advertised, I don't think it ever mattered before.

cockroach/pkg/storage/replica_proposal_quota.go

Lines 204 to 209 in c097a16

    
           // Only consider followers who are in advance of the quota base 
        
           // index. This prevents a follower from coming back online and 
        
           // preventing throughput to the range until it has caught up. 
        
           if progress.Match < r.mu.proposalQuotaBaseIndex { 
        
           	return 
        
           }

danhhz · 2019-06-27T00:57:32Z

I can reproduce this readily with a ~10m roachtest that's basically Tobi's original instructions. In addition to absurdly long response times on SQL queries, I sometimes can't even get a SQL connection to the just restarted node (getting user info?). (I'm also curious if this effect is the underlying cause of what I was seeing in #36879.)

I've looked at this today and I think we need two mitigations. First, when a node is down for a bit (~minutes), we make it less likely for it to need a snapshot when it is restarted. Second, we lessen the impact if it does need snapshots. I've focused first on things that are hopefully safe enough to backport into 19.1.3, to unblock our affected customer. 19.1.3 is cut a week from Monday, so I'm hoping to get fixes in soon so they can bake for a week and shake out any unforeseen consequences.

To avoid snapshots when a node comes back, we'll make the log truncation less aggressive. I think something like what's described in #37906 (comment) is a good long term solution, but to me it feels too risky to backport without more baking time than we have. A good compromise is to simply raise the threshold for what the raft log queue considers "recently active". (The raft log queue doesn't truncate when a follower is missing but has been recently active.) Recent currently has a threshold of 10s and shares the same MaxQuotaReplicaLivenessDuration constant as the quota pool. This sharing seems to have been introduced in #32439 but not for any particular principled reason. A natural place to raise the threshold to is the server.time_until_store_dead cluster setting, which is the time we give up on the node and start transferring away its replicas.

Experimentally this seems to work decently well. There are still a small number of snapshots, but it's small enough that the node recovers quickly when it comes back up. We should investigate why there are still this small number of snapshots, but in the meantime the second mitigation (lessening the impact of shapshots on a node that's just come up) will cover the rest. Note that this only helps if the node is down for less than server.time_until_store_dead, which defaults to 5m. If I leave the node down for 7m, there is a rush of raft log truncations at 5m and when it comes back up, it needs tons of snapshots. However, our customer can temporarily raise the server.time_until_store_dead setting while doing the rolling migration and lower it when the migration is done.

For the second mitigation, I think we need to have some sort of timeout on redirectOnOrAcquireLease, will will return NotLeaseholderError when it fires. redirectOnOrAcquireLease currently blocks on getting the snapshot, which can be hours if every replica is getting one and this replica is at the end of the line. Relying on this timeout will have an affect on tail latencies, but so far, it's the least risky thing I've thought of. Some more sophisticated ideas have been tossed around to do this without the tail latency, but it's nice to have a low-risk, obvious-how-it-works CYA here. I have yet to run experiments for this. The two big open questions are what threshold to use for the timeout and what happens if the timeout fires on the leaseholder?

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, when we replicated the raft log truncation, we'd have to include it in snapshots anyway, but that's changed recently. The problem is when a node is down for a short time but more than 10 seconds (think a couple minutes). It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit ties the threshold that the raft log queue uses for recent activity to `server.time_until_store_dead`, which is already the threshold where we give up on a down node and start transferring away its replicas. It defaults to 5 minutes, which is still short, but it's more reasonable than 10 seconds. Crucially, it also is a cluster setting, so it can be overridden. We'd like to move even further in the direction of leniency about raft log truncation while a replica is missing, but this is a compromise that's intended to be less risky to backport to 19.1.x. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down for less than `server.time_until_store_dead` now recover more quickly when they rejoin.

tbg · 2019-06-27T12:43:13Z

I posted some discussion of the log truncation approach on the review in #38484. The TL;DR is that the most intuitive safe code change seems to not care about the 5 minute distinction, which is nice. We leave nodes alone when they're up, and when they're not up we also leave them alone unless the log is large, in which case we sacrifice them. (Today the "leave them alone unless" isn't there, we go straight to sacrificing by pretending the followers aren't even there).

For the timeout in redirectOnOrAcquireLease, I think we'll want to start with something conservative (i.e. in the at least a few seconds range. 5?). This is owed just to the fact that it'll go into a patch release and while a shorter timeout would solve the problem "better" it may also "do things" in unexpected places.

For implementing this, do you really have to do anything more than change this code

cockroach/pkg/storage/replica_range_lease.go

Lines 1028 to 1032 in c097a16

    
           case <-slowTimer.C: 
        
           	slowTimer.Read = true 
        
           	log.Warningf(ctx, "have been waiting %s attempting to acquire lease", 
        
           		base.SlowRequestThreshold) 
        
           	r.store.metrics.SlowLeaseRequests.Inc(1)

to early return with the NotLeaseholderError, plus lowering duration until that branch fires?

bdarnell · 2019-06-27T22:00:16Z

A long-term solution to the redirectOnOrAcquireLease problem would be to replace raft-level proposal forwarding with an RPC that could return the index of the new proposal. Then we could know to stop waiting if our index is way behind the index of our proposed lease. (or we could forward the proposal with a maxIndex so it could be dropped if the leader is ahead). Replacing the proposal flow is also contemplated in #21849.

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down for less than `server.time_until_store_dead` now recover more quickly when they rejoin.

tbg · 2019-06-27T22:19:11Z

I proposed something similar in #37906 (comment), but I like the idea of making this about Raft proposal forwarding. If the follower includes its commit index into the proposal the leader will be in a good position to make a reasonable call via heuristics, though this necessarily involves some magic values which would be good to avoid. Instead, the leader could refuse proposals unless the follower is in StateReplicate. There are lots of details to figure out with this, but it seems promising.

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down for less than `server.time_until_store_dead` now recover more quickly when they rejoin.

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes).

38484: storage: truncate raft log less aggressively when replica is missing r=tbg a=danhhz Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, when we replicated the raft log truncation, we'd have to include it in snapshots anyway, but that's changed recently. The problem is when a node is down for a short time but more than 10 seconds (think a couple minutes). It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit ties the threshold that the raft log queue uses for recent activity to `server.time_until_store_dead`, which is already the threshold where we give up on a down node and start transferring away its replicas. It defaults to 5 minutes, which is still short, but it's more reasonable than 10 seconds. Crucially, it also is a cluster setting, so it can be overridden. We'd like to move even further in the direction of leniency about raft log truncation while a replica is missing, but this is a compromise that's intended to be less risky to backport to 19.1.x. Partial mitigation for #37906 Potentially also helps with #36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes). Co-authored-by: Daniel Harrison <[email protected]> Co-authored-by: Tobias Schottdorf <[email protected]>

Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, snapshots historically had to include the Raft log which implied a need to keep the size of the Raft log tightly controlled due to being pulled into memory at the snapshot receiver, but that's changed recently. The problem is when a node is down for longer than 10 seconds but shorter than the time it takes to upreplicate all of its ranges onto new nodes. It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit avoids truncating the log past any follower's position when all replicas have recently been active (the quota pool keeps it from growing without bound in this case). If at least one replica hasn't recently been active, it holds off any truncation until the log reaches a size threshold. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down now recover more quickly when they rejoin, assuming they weren't down for much more than the value of the `server.time_until_store_dead` cluster setting (which defaults to 5 minutes).

tbg · 2019-07-03T12:56:12Z

A first mitigation has landed on master and is very likely to be included in the next 19.1.x patch release. While this first mitigation has solved the problem in our testing and should greatly reduce it in most real-world scenarios too, we are working on additional safeguards related to this issue:

prevent lease requests from hanging
prevent leadership to be attained by node that is far behind.

bdarnell · 2019-07-15T16:40:54Z

Should we backport #38484 to 2.1 as well?

danhhz · 2019-07-15T16:59:37Z

Does 2.1 have the work that removes the raft log from snapshots? If not, we might want to do 2.1-specific stability testing of that PR before backporting.

tbg · 2019-07-16T12:16:08Z

No, that's 19.1. With 2.1 we run a higher risk of seeing that snapshot refusal loop again (I think snapshots with 8mb are refused, and truncation would kick in at 4mb). I believe the refusal code was often triggered by SSTs which clock in at ~22mb each from what I remember, so changing the truncation heuristics likely wouldn't make a big difference for the worse (but it should make one for the better from what we can tell). But we'd need to do our due diligence there, I'm thinking that it would pass the new roachtest 5x and a reasonably sized import (1+ hour) that passes reliably (are there any of them?) before the cherry pick should continue to do so after.

This patch aims to improve the behavior in scenarios where a follower replica is behind, unaware of the latest lease, and it tries to acquire a lease in its ignorant state. That lease acquisition request is bound to fail (because the lease that it's based on is stale), but while it fails (or, rather, until the behind replica finds out that it failed) local requests are blocked. This blocking can last for a very long time in situations where a snapshot is needed to catch up the follower, and the snapshot is queued up behind many other snapshots (e.g. after a node has been down for a while and gets restarted). This patch tries an opinionated solution: never allow followers to acquire leases. If there is a leader, it's a better idea for the leader to acquire the lease. The leader might have a lease anyway or, even if it doesn't, having the leader acquire it saves a leadership transfer (leadership follows the lease). We change the proposal path to recognize lease requests and reject them early if the current replica is a follower and the leader is known. The rejection points to the leader, which causes the request that triggered the lease acquisition to make its way to the leader and attempt to acquire a lease over there. Fixes cockroachdb#37906 As described in cockroachdb#37906, the badness caused by requests blocking behind a doomed lease acq request could be reproduced with a 100-warehouse tpcc workload (--no-wait) on a 3 node cluster. Turning off a node for 10 minutes and then turning it back on would result in the cluster being wedged for a few minutes until all the snapshots are transferred. I've verified that this patch fixes it. Release note (bug fix): A bug causing queries sent to a freshly-restarted node to sometimes hang for a long time while the node catches up with replication has been fixed.

This reverts commit 8f98ade. This is a revert of the main commit in cockroachdb#57789 (which was a backport of cockroachdb#55148). It turns out that the commit in question introduced a deadlock: if replicas that are not the leader refuse to take the lease (because their not the leader) and the leader is a VOTER_INCOMING, VOTER_DEMOTING or VOTER_OUTGOING replica which also refuses to take the lease because it's not a VOTER_FULL [0] => deadlock. This deadlock was found in cockroachdb#57798. The patch will return with some massaging. This revert is original work on the 20.2 branch; I'm not reverting it on master in the hope of just fixing the issue. Touches cockroachdb#37906 Release note: This voids the previous release note reading "A bug causing queries sent to a freshly-restarted node to sometimes hang for a long time while the node catches up with replication has been fixed."

…iveness Fixes cockroachdb#84655. Related to cockroachdb#49220. This commit extends the logic introduced in 8aa1c14 to simultaneously campaign for Raft leadership when rejecting a lease request on a Raft follower that observes that the current Raft leader is not live according to node liveness. These kinds of cases are often associated with asymmetric network partitions. In such cases, the Raft leader will be unable to acquire an epoch-based lease until it heartbeats its liveness record. As a result, the range can only regain availability if a different replica acquires the lease. However, the protection added in 8aa1c14 prevents followers from acquiring leases to protect against a different form of unavailability. After this commit, the followers will attempt to acquire Raft leadership when detecting such cases by campaigning. This allows these ranges to recover availability once Raft leadership moves off the partitioned Raft leader to one of the followers that can reach node liveness and can subsequently acquire the lease. Campaigning for Raft leadership is safer than blindly allowing the lease request to be proposed (through a redirected proposal). This is because the follower may be arbitrarily far behind on its Raft log and acquiring the lease in such cases could cause unavailability (the kind we saw in cockroachdb#37906). By instead calling a Raft pre-vote election, the follower can determine whether it is behind on its log without risking disruption. If so, we don't want it to acquire the lease — one of the other followers that is caught up on its log can. If not, it will eventually become leader and can proceed with a future attempt to acquire the lease. The commit adds a test that reproduces the failure mode described in cockroachdb#84655. It creates an asymmetric network partition scenario that looks like: ``` [0] raft leader / initial leaseholder ^ / \ / \ v v [1]<--->[2] raft followers ^ ^ \ / \ / v [3] liveness range ``` It then waits for the raft leader's lease to expire and demonstrates that one of the raft followers will now call a Raft election, which allows it to safely grab Raft leadership, acquire the lease, and recover availability. Without the change, the test failed. Release justification: None. Too risky for the stability period. Potential backport candidate after sufficient baking on master. Release note (bug fix): A bug causing ranges to remain without a leaseholder in cases of asymmetric network partitions has been resolved.

87244: kv: campaign on rejected lease request when leader not live in node liveness r=erikgrinaker a=nvanbenschoten Fixes #84655. Related to #49220. This commit extends the logic introduced in 8aa1c14 to simultaneously campaign for Raft leadership when rejecting a lease request on a Raft follower that observes that the current Raft leader is not live according to node liveness. These kinds of cases are often associated with asymmetric network partitions. In such cases, the Raft leader will be unable to acquire an epoch-based lease until it heartbeats its liveness record. As a result, the range can only regain availability if a different replica acquires the lease. However, the protection added in 8aa1c14 prevents followers from acquiring leases to protect against a different form of unavailability. After this commit, the followers will attempt to acquire Raft leadership when detecting such cases by campaigning. This allows these ranges to recover availability once Raft leadership moves off the partitioned Raft leader to one of the followers that can reach node liveness and can subsequently acquire the lease. Campaigning for Raft leadership is safer than blindly allowing the lease request to be proposed (through a redirected proposal). This is because the follower may be arbitrarily far behind on its Raft log and acquiring the lease in such cases could cause unavailability (the kind we saw in #37906). By instead calling a Raft pre-vote election, the follower can determine whether it is behind on its log without risking disruption. If so, we don't want it to acquire the lease — one of the other followers that is caught up on its log can. If not, it will eventually become leader and can proceed with a future attempt to acquire the lease. The commit adds a test that reproduces the failure mode described in #84655. It creates an asymmetric network partition scenario that looks like: ``` [0] raft leader / leaseholder ^ / \ / \ v v [1]<--->[2] raft followers ^ ^ \ / \ / v [3] liveness range ``` It then waits for the raft leader's lease to expire and demonstrates that one of the raft followers will now call a Raft election, which allows it to safely grab Raft leadership, acquire the lease, and recover availability. Without the change, the test failed. ---- Release justification: None. Too risky for the stability period. Potential backport candidate after sufficient baking on master. Release note (bug fix): A bug causing ranges to remain without a leaseholder in cases of asymmetric network partitions has been resolved. 88101: kv,sql: simplify the Txn API by removing 2 cleanup functions r=lidorcarmel a=lidorcarmel Txn.CleanupOnError() basically does a rollback, and in addition takes an error only for the purpose of logging it. Txn.CommitOrCleanup() tries to commit and if unsuccessful it tries a rollback. The error from the rollback is logged but not returned, the error from the commit is returned. Removing these 2 functions means that the caller should call Commit and Rollback directly when needed, and handle the returned errors. For example, sql may need to log errors to a different channel from the one used but Txn, and tests may want to fail when a Rollback fails unexpectedly. This PR removes those functions. Release note: None Epic: None 91011: storage: adjust range tombstone language to encourage caution r=nicktrav a=jbowens Adjust the MVCC range tombstone cluster setting description to highlight its experimental nature and to be appropriately cautious about the consequence of enabling range tombstones. Cockroach 22.2.0 will ship with the cluster setting and likely at least one bug that may induce corruption if the setting is enabled (#90948). Epic: None Release note: None Close #91001. 91077: sql/schemachange: revert TableZoneConfig field ID r=fqazi a=fqazi Fixes: #91053 Backwards compatibility with 22.2 was accidentally regressed, when the field ID for tabe_zone_config was changed for older declarative schema changer states. To address this, this patch will revert the field ID back to the same value as 22.2 Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Lidor Carmel <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Faizan Qazi <[email protected]>

…iveness Fixes cockroachdb#84655. Related to cockroachdb#49220. This commit extends the logic introduced in 8aa1c14 to simultaneously campaign for Raft leadership when rejecting a lease request on a Raft follower that observes that the current Raft leader is not live according to node liveness. These kinds of cases are often associated with asymmetric network partitions. In such cases, the Raft leader will be unable to acquire an epoch-based lease until it heartbeats its liveness record. As a result, the range can only regain availability if a different replica acquires the lease. However, the protection added in 8aa1c14 prevents followers from acquiring leases to protect against a different form of unavailability. After this commit, the followers will attempt to acquire Raft leadership when detecting such cases by campaigning. This allows these ranges to recover availability once Raft leadership moves off the partitioned Raft leader to one of the followers that can reach node liveness and can subsequently acquire the lease. Campaigning for Raft leadership is safer than blindly allowing the lease request to be proposed (through a redirected proposal). This is because the follower may be arbitrarily far behind on its Raft log and acquiring the lease in such cases could cause unavailability (the kind we saw in cockroachdb#37906). By instead calling a Raft pre-vote election, the follower can determine whether it is behind on its log without risking disruption. If so, we don't want it to acquire the lease — one of the other followers that is caught up on its log can. If not, it will eventually become leader and can proceed with a future attempt to acquire the lease. The commit adds a test that reproduces the failure mode described in cockroachdb#84655. It creates an asymmetric network partition scenario that looks like: ``` [0] raft leader / initial leaseholder ^ / \ / \ v v [1]<--->[2] raft followers ^ ^ \ / \ / v [3] liveness range ``` It then waits for the raft leader's lease to expire and demonstrates that one of the raft followers will now call a Raft election, which allows it to safely grab Raft leadership, acquire the lease, and recover availability. Without the change, the test failed. Release justification: None. Too risky for the stability period. Potential backport candidate after sufficient baking on master. Release note (bug fix): A bug causing ranges to remain without a leaseholder in cases of asymmetric network partitions has been resolved.

…iveness Fixes #84655. Related to #49220. This commit extends the logic introduced in 8aa1c14 to simultaneously campaign for Raft leadership when rejecting a lease request on a Raft follower that observes that the current Raft leader is not live according to node liveness. These kinds of cases are often associated with asymmetric network partitions. In such cases, the Raft leader will be unable to acquire an epoch-based lease until it heartbeats its liveness record. As a result, the range can only regain availability if a different replica acquires the lease. However, the protection added in 8aa1c14 prevents followers from acquiring leases to protect against a different form of unavailability. After this commit, the followers will attempt to acquire Raft leadership when detecting such cases by campaigning. This allows these ranges to recover availability once Raft leadership moves off the partitioned Raft leader to one of the followers that can reach node liveness and can subsequently acquire the lease. Campaigning for Raft leadership is safer than blindly allowing the lease request to be proposed (through a redirected proposal). This is because the follower may be arbitrarily far behind on its Raft log and acquiring the lease in such cases could cause unavailability (the kind we saw in #37906). By instead calling a Raft pre-vote election, the follower can determine whether it is behind on its log without risking disruption. If so, we don't want it to acquire the lease — one of the other followers that is caught up on its log can. If not, it will eventually become leader and can proceed with a future attempt to acquire the lease. The commit adds a test that reproduces the failure mode described in #84655. It creates an asymmetric network partition scenario that looks like: ``` [0] raft leader / initial leaseholder ^ / \ / \ v v [1]<--->[2] raft followers ^ ^ \ / \ / v [3] liveness range ``` It then waits for the raft leader's lease to expire and demonstrates that one of the raft followers will now call a Raft election, which allows it to safely grab Raft leadership, acquire the lease, and recover availability. Without the change, the test failed. Release justification: None. Too risky for the stability period. Potential backport candidate after sufficient baking on master. Release note (bug fix): A bug causing ranges to remain without a leaseholder in cases of asymmetric network partitions has been resolved.

tbg added S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. A-kv-replication Relating to Raft, consensus, and coordination. A-kv-distribution Relating to rebalancing and leasing. labels May 29, 2019

jseldess mentioned this issue Jun 20, 2019

Known limitation: Restarted node in need of snapshots can be wedged for long time cockroachdb/docs#4957

Closed

awoods187 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 21, 2019

andy-kimball assigned danhhz Jun 26, 2019

danhhz mentioned this issue Jun 27, 2019

storage: truncate raft log less aggressively when replica is missing #38484

Merged

tbg mentioned this issue Jun 28, 2019

Node takes a long time to restart after killing during workload. #38531

Closed

tbg mentioned this issue Jul 3, 2019

backport-19.1: storage: truncate raft log less aggressively when replica is missing #38642

Merged

andreimatei mentioned this issue Dec 10, 2020

backport-20.2: kvserver: don't allow raft forwarding of lease requests #57789

Merged

tbg mentioned this issue Jan 6, 2021

roachtest: acceptance/bank/node-restart flake #57798

Closed

andreimatei mentioned this issue Jan 6, 2021

Revert "kvserver: don't allow raft forwarding of lease requests" #58500

Merged

tbg mentioned this issue Feb 23, 2021

kv: investigate brown-out while scaling up under load #61005

Closed

tbg mentioned this issue Mar 8, 2021

roachtest: splits/largerange/size=32GiB,nodes=6 failed [stuck in kv] #61541

Closed

irfansharif mentioned this issue Apr 23, 2021

kvserver: circuit-break requests to unavailable ranges #33007

Closed

tbg mentioned this issue Jan 6, 2022

kvserver: simplify lease forwarding constraints #74506

Closed

nvanbenschoten mentioned this issue May 19, 2022

kv: understand, prevent, and recover quickly from a leaseholder needing a Raft snapshot #81561

Closed

2 tasks

nvanbenschoten mentioned this issue Sep 1, 2022

kv: campaign on rejected lease request when leader not live in node liveness #87244

Merged

nvanbenschoten mentioned this issue Jan 13, 2023

release-22.2: kv: campaign on rejected lease request when leader not live in node liveness #95221

Merged

nvanbenschoten mentioned this issue Jan 13, 2023

release-22.1: kv: campaign on rejected lease request when leader not live in node liveness #95237

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: restarted node in need of snapshots can be wedged for long time #37906

storage: restarted node in need of snapshots can be wedged for long time #37906

tbg commented May 29, 2019

andy-kimball commented Jun 26, 2019

tbg commented Jun 26, 2019

tbg commented Jun 26, 2019 •

edited

Loading

bdarnell commented Jun 26, 2019

tbg commented Jun 26, 2019

danhhz commented Jun 27, 2019

tbg commented Jun 27, 2019

bdarnell commented Jun 27, 2019

tbg commented Jun 27, 2019

tbg commented Jul 3, 2019

bdarnell commented Jul 15, 2019

danhhz commented Jul 15, 2019

tbg commented Jul 16, 2019

storage: restarted node in need of snapshots can be wedged for long time #37906

storage: restarted node in need of snapshots can be wedged for long time #37906

Comments

tbg commented May 29, 2019

andy-kimball commented Jun 26, 2019

tbg commented Jun 26, 2019

tbg commented Jun 26, 2019 • edited Loading

bdarnell commented Jun 26, 2019

tbg commented Jun 26, 2019

danhhz commented Jun 27, 2019

tbg commented Jun 27, 2019

bdarnell commented Jun 27, 2019

tbg commented Jun 27, 2019

tbg commented Jul 3, 2019

bdarnell commented Jul 15, 2019

danhhz commented Jul 15, 2019

tbg commented Jul 16, 2019

tbg commented Jun 26, 2019 •

edited

Loading