forked from cockroachdb/cockroach
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage: truncate raft log less aggressively when replica is missing
Previously, we'd hold off on truncating the raft log if a replica was missing but contactable in the last 10 seconds. This meant that if a node was down for *more* than 10 seconds, there was a good chance that we'd truncate logs for some of its replicas (especially for its high-traffic ones) and it would need snapshots for them when it came back up. This was for two reasons. First, we've historically assumed that it's cheaper to catch a far-behind node up with a snapshot than with entries. Second, when we replicated the raft log truncation, we'd have to include it in snapshots anyway, but that's changed recently. The problem is when a node is down for a short time but more than 10 seconds (think a couple minutes). It might come back up to find that it needs a snapshot for most ranges. We rate limit snapshots fairly aggressively because they've been disruptive in the past, but this means that it could potentially take hours for a node to recover from a 2 minute outage. This would be merely unfortunate if there wasn't a second compounding issue. A recently restarted node has a cold leaseholder cache. When it gets traffic for one of its replicas, it first tries itself as the leaseholder (maybe it will get lucky and won't need the network hop). Unfortunately, if the replica needs a snapshot, this decision currently blocks on it. This means that requests sent to the recently started node could block for as long as the heavily-throttled snapshots take, hours or even days. Short outages of more than 10 seconds are reasonably common with routine maintenance (rolling to a new version, swapping hardware, etc), so it's likely that customers will hit this (and one did). This commit ties the threshold that the raft log queue uses for recent activity to `server.time_until_store_dead`, which is already the threshold where we give up on a down node and start transferring away its replicas. It defaults to 5 minutes, which is still short, but it's more reasonable than 10 seconds. Crucially, it also is a cluster setting, so it can be overridden. We'd like to move even further in the direction of leniency about raft log truncation while a replica is missing, but this is a compromise that's intended to be less risky to backport to 19.1.x. Partial mitigation for cockroachdb#37906 Potentially also helps with cockroachdb#36879 Release note (bug fix): Nodes that have been down for less than `server.time_until_store_dead` now recover more quickly when they rejoin.
- Loading branch information
Showing
8 changed files
with
96 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
// Copyright 2019 The Cockroach Authors. | ||
// | ||
// Use of this software is governed by the Business Source License | ||
// included in the file licenses/BSL.txt. | ||
// | ||
// As of the Change Date specified in that file, in accordance with | ||
// the Business Source License, use of this software will be governed | ||
// by the Apache License, Version 2.0, included in the file | ||
// licenses/APL.txt. | ||
|
||
package main | ||
|
||
import ( | ||
"context" | ||
"fmt" | ||
"time" | ||
|
||
"github.com/cockroachdb/cockroach/pkg/util/timeutil" | ||
) | ||
|
||
func runRestart(ctx context.Context, t *test, c *cluster, downDuration time.Duration) { | ||
crdbNodes := c.Range(1, c.nodes) | ||
workloadNode := c.Node(1) | ||
const restartNode = 3 | ||
|
||
t.Status("installing cockroach") | ||
c.Put(ctx, cockroach, "./cockroach", crdbNodes) | ||
c.Start(ctx, t, crdbNodes, startArgs(`--args=--vmodule=raft_log_queue=3`)) | ||
|
||
// We don't really need tpcc, we just need a good amount of traffic and a good | ||
// amount of data. | ||
t.Status("importing tpcc fixture") | ||
c.Run(ctx, workloadNode, | ||
"./cockroach workload fixtures import tpcc --warehouses=100 --fks=false --checks=false") | ||
|
||
// Stop a node. | ||
c.Stop(ctx, c.Node(restartNode)) | ||
|
||
// Wait for between 10s and `server.time_until_store_dead` while sending | ||
// traffic to one of the nodes that are not down. This used to cause lots of | ||
// raft log truncation, which caused node 3 to need lots of snapshots when it | ||
// came back up. | ||
c.Run(ctx, workloadNode, "./cockroach workload run tpcc --warehouses=100 "+ | ||
fmt.Sprintf("--tolerate-errors --wait=false --duration=%s", downDuration)) | ||
|
||
// Bring it back up and make sure it can serve a query within a reasonable | ||
// time limit. For now, less time than it was down for. | ||
c.Start(ctx, t, c.Node(restartNode)) | ||
start := timeutil.Now() | ||
restartNodeDB := c.Conn(ctx, restartNode) | ||
if _, err := restartNodeDB.Exec(`SELECT count(*) FROM tpcc.order_line`); err != nil { | ||
t.Fatal(err) | ||
} | ||
if took := timeutil.Since(start); took > downDuration { | ||
t.Fatalf(`expected to recover within %s took %s`, downDuration, took) | ||
} else { | ||
c.l.Printf(`connecting and query finished in %s`, took) | ||
} | ||
} | ||
|
||
func registerRestart(r *registry) { | ||
r.Add(testSpec{ | ||
Name: fmt.Sprintf("restart/down-for-2m"), | ||
Cluster: makeClusterSpec(3), | ||
// "cockroach workload is only in 19.1+" | ||
MinVersion: "v19.1.0", | ||
Run: func(ctx context.Context, t *test, c *cluster) { | ||
runRestart(ctx, t, c, 2*time.Minute) | ||
}, | ||
}) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters