storage: don't truncate when probing active followers are present #32439

tbg · 2018-11-17T00:04:43Z

I need to clean up the first commit, putting the methods on replica.mu instead.

This is yet another root cause of Raft snapshots during import/restore. We
create a number of new ranges which may experience Raft leadership changes
early on and ingest SSTs at the same time. This leads to a situation in
which we have followers whose acknowledged log index isn't known, which in
turn leads to bad truncation decisions. Respect such followers (which are
usually only in this state for very brief amounts of time).

Release note: None

cockroach-teamcity · 2018-11-17T00:04:49Z

This change is

petermattis

Reviewed 2 of 2 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/storage/raft_log_queue.go, line 160 at r2 (raw file):

		// (including only resetting when the follower resumes replicating).
		pr.PendingSnapshot = pendingSnapshotIndex
	}

I think it is worthwhile to unit test this code as well. Perhaps it should be pulled out into a separate routine to ease the testing (i.e. something that takes a Raft status, descriptor and lastUpdateTimes map and returns the updated status).

pkg/storage/raft_log_queue.go, line 179 at r2 (raw file):

	truncatableIndexChosenViaQuorumIndex     = "quorum"
	truncatableIndexChosenViaFollowers       = "followers"
	truncatableIndexChosenViaProbingFollower = "probing-follower"

s/probing-follower/probing follower/g for consistency with the other two word reasons (e.g. pending snapshot, first index, etc).

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/raft_log_queue.go, line 160 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think it is worthwhile to unit test this code as well. Perhaps it should be pulled out into a separate routine to ease the testing (i.e. something that takes a Raft status, descriptor and lastUpdateTimes map and returns the updated status).

Thank you for making me write that unit test. I wasn't writing pr back into prs above, so in effect all that fancy code wasn't doing anything (RecentActive would've always been true).

pkg/storage/raft_log_queue.go, line 179 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/probing-follower/probing follower/g for consistency with the other two word reasons (e.g. pending snapshot, first index, etc).

Done.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/raft_log_queue.go, line 160 at r2 (raw file):

Previously, tbg (Tobias Schottdorf) wrote…

Thank you for making me write that unit test. I wasn't writing pr back into prs above, so in effect all that fancy code wasn't doing anything (RecentActive would've always been true).

Glad to be of service.

pkg/storage/raft_log_queue.go, line 146 at r4 (raw file):

	// true since we don't use CheckQuorum) with our own activity check.
	r.mu.RLock()
	updateRaftProgressFromActivity(ctx, raftStatus.Progress, r.descRLocked().Replicas, r.mu.lastUpdateTimes, now)

Nit: wrap this line at 100 columns.

This is a more natural place to put them a) because there's no reason to have them sit on Replica and b) because this will allow me to write better code when I query this map in the log truncation code. Release note: None

tbg

PTAL. I had to make some more minor adjustments, including one that prompted an upstream PR (with a workaround for the time being).

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/raft_log_queue.go, line 146 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: wrap this line at 100 columns.

Done (tests are failing left and right now anyway), but generally, I'd prefer that we either put this into crlfmt or leave these alone in reviews. Do you know if we've tried and failed?

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/raft_log_queue.go, line 146 at r4 (raw file):

Previously, tbg (Tobias Schottdorf) wrote…

Done (tests are failing left and right now anyway), but generally, I'd prefer that we either put this into crlfmt or leave these alone in reviews. Do you know if we've tried and failed?

I don't know if we've tried. The primary reason I care about the line length limit is that it makes reviewing in reviewable easier.

pkg/storage/replica.go, line 456 at r6 (raw file):

		// TODO(tschottdorf): keeping a map on each replica seems to be
		// overdoing it. We should map the replicaID to a NodeID and then use
		// node liveness (or any sensible measure of the peer being around).

The motivation for tracking activity per-range is that a node could be active while a replica on that node is stuck for some reason. Not sure if this is sufficient motivation, but there it is. Would be good to expand this TODO with this additional detail so that we don't forget about it whenever the TODO is addressed.

pkg/storage/replica.go, line 5260 at r6 (raw file):

	fromReplica, fromErr := r.getReplicaDescriptorByIDRLocked(roachpb.ReplicaID(msg.From), r.mu.lastToReplica)
	toReplica, toErr := r.getReplicaDescriptorByIDRLocked(roachpb.ReplicaID(msg.To), r.mu.lastFromReplica)
	if isHB := msg.Type == raftpb.MsgHeartbeat; isHB {

The isHB variable is inhibiting readability. I'm guessing it is detritus left over from earlier work.

pkg/storage/replica.go, line 5267 at r6 (raw file):

		// them into the local Raft group. The leader won't hit that path, so we
		// update it whenever it sends a heartbeat. In effect, this makes sure
		// it always sees itself as alive.

I wonder if it would be better to special-case the leader when using lastUpdateTimes. I don't have the usage paged-in, to feel free to ignore this idea if it is misguided.

tbg

TFTR! I'm going to whack this with a few testraces on my worker to make up for the fact that TC won't really testrace many potentially affected tests in CI.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica.go, line 456 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The motivation for tracking activity per-range is that a node could be active while a replica on that node is stuck for some reason. Not sure if this is sufficient motivation, but there it is. Would be good to expand this TODO with this additional detail so that we don't forget about it whenever the TODO is addressed.

Right, but then we're already screwed because node liveness doesn't capture any of that neither. Updating the comment regardless.

pkg/storage/replica.go, line 5260 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The isHB variable is inhibiting readability. I'm guessing it is detritus left over from earlier work.

Done.

pkg/storage/replica.go, line 5267 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if it would be better to special-case the leader when using lastUpdateTimes. I don't have the usage paged-in, to feel free to ignore this idea if it is misguided.

I considered that, but I think it's ultimately a bad idea to special case because every caller needs to do the right thing with it. For example, the quota pool is completely oblivious to these changes and would omit the leader in its computations. That is fine, but was probably not intentional. There are various other places in the code which compute ad-hoc quorum indexes. I want them to avoid this same kind of pitfall as they get made smarter in the future.

This is yet another root cause of Raft snapshots during import/restore. We create a number of new ranges which may experience Raft leadership changes early on and ingest SSTs at the same time. This leads to a situation in which we have followers whose acknowledged log index isn't known, which in turn leads to bad truncation decisions. Respect such followers (which are usually only in this state for very brief amounts of time). Release note: None

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

tbg · 2018-11-20T20:02:41Z

Wow, turns out running the logictests under testrace is a suicidal idea.

bors r=petermattis

32439: storage: don't truncate when probing active followers are present r=petermattis a=tbg I need to clean up the first commit, putting the methods on `replica.mu` instead. ---- This is yet another root cause of Raft snapshots during import/restore. We create a number of new ranges which may experience Raft leadership changes early on and ingest SSTs at the same time. This leads to a situation in which we have followers whose acknowledged log index isn't known, which in turn leads to bad truncation decisions. Respect such followers (which are usually only in this state for very brief amounts of time). Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

craig · 2018-11-20T20:31:58Z

Build succeeded

GitHub CI (Cockroach)

tbg requested a review from a team November 17, 2018 00:04

petermattis approved these changes Nov 17, 2018

View reviewed changes

tbg force-pushed the fix/no-trunc-probees branch from a542a0d to 4e6f396 Compare November 19, 2018 14:03

tbg commented Nov 19, 2018

View reviewed changes

petermattis approved these changes Nov 19, 2018

View reviewed changes

storage: factor out lastUpdateMap's accessors

187585b

This is a more natural place to put them a) because there's no reason to have them sit on Replica and b) because this will allow me to write better code when I query this map in the log truncation code. Release note: None

tbg force-pushed the fix/no-trunc-probees branch from 4e6f396 to 297a14c Compare November 20, 2018 12:51

tbg commented Nov 20, 2018

View reviewed changes

petermattis approved these changes Nov 20, 2018

View reviewed changes

tbg commented Nov 20, 2018

View reviewed changes

tbg force-pushed the fix/no-trunc-probees branch from 297a14c to 7a7e1cb Compare November 20, 2018 15:00

petermattis approved these changes Nov 20, 2018

View reviewed changes

craig bot merged commit 7a7e1cb into cockroachdb:master Nov 20, 2018

tbg deleted the fix/no-trunc-probees branch November 23, 2018 11:49

tbg mentioned this pull request Dec 10, 2018

backport-2.1: storage: improve logging and log Raft truncation #32972

Merged

danhhz mentioned this pull request Jun 27, 2019

storage: restarted node in need of snapshots can be wedged for long time #37906

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: don't truncate when probing active followers are present #32439

storage: don't truncate when probing active followers are present #32439

tbg commented Nov 17, 2018

cockroach-teamcity commented Nov 17, 2018

petermattis left a comment

tbg left a comment

petermattis left a comment

tbg left a comment

petermattis left a comment

tbg left a comment

petermattis left a comment

tbg commented Nov 20, 2018

craig bot commented Nov 20, 2018

storage: don't truncate when probing active followers are present #32439

storage: don't truncate when probing active followers are present #32439

Conversation

tbg commented Nov 17, 2018

cockroach-teamcity commented Nov 17, 2018

petermattis left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

tbg commented Nov 20, 2018

craig bot commented Nov 20, 2018

Build succeeded