storage: stop modifying requests and returning responses from TxnWaitQueue #43383

nvanbenschoten · 2019-12-20T01:15:35Z

This PR contains three pieces of cleanup that pave the way for pulling the TxnWaitQueue underneath the upcoming pkg/storage/concurrency package. It removes all cases where the queue mandates an update to the PushTxnRequest that entered it. It also removes all cases where the queue could field requests directly and return results for them. This restricts the TxnWaitQueue's interface and allows it to be pushed into the concurrency manager abstraction without bloating this new abstraction.

cockroach-teamcity · 2019-12-20T01:15:45Z

This change is

ajwerner

I quite like this change. It simplifies the role of the txnwait queue nicely.

I ran YCSB A at a few concurrency levels and saw no difference in performance.

That's encouraging enough for me.

Reviewed 4 of 4 files at r1, 4 of 4 files at r2, 5 of 5 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/storage/batcheval/cmd_push_txn.go, line 128 at r1 (raw file):

	now := cArgs.EvalCtx.Clock().Now()
	if now.Less(h.Timestamp) {
		// The batch's timestamp should have been used to udpate the clock.

udpate

pkg/storage/txnwait/queue.go, line 653 at r2 (raw file):

					)
					metrics.DeadlocksTotal.Inc(1)
					// TODO(nvanbenschoten): As it, it would make sense to return

If for whatever reason you back off of the last commit: s/As it/As is/

This commit removes the use of a PushRequest's batch header timestamp to determine whether a pushee has expired or not. This is the first step in a refactor to remove cases where we mutate requests on the server before command evaluation. There is no migration needed for this change because it's impact is local to a single node - the leaseholder evaluating the request. Release note: None

nvanbenschoten

TFTR! I actually decided to peel the entire final commit off for now. I don't think it's too much of an issue for the concurrency package to have the ability to field requests directly with responses where possible, as long as it's never mutating requests.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/storage/batcheval/cmd_push_txn.go, line 128 at r1 (raw file):

Previously, ajwerner wrote…

udpate

Done.

pkg/storage/txnwait/queue.go, line 653 at r2 (raw file):

Previously, ajwerner wrote…

If for whatever reason you back off of the last commit: s/As it/As is/

Removed the entire commit.

ajwerner

Reviewed 4 of 8 files at r4, 4 of 4 files at r5.
Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/storage/txnwait/queue_test.go, line 31 at r5 (raw file):

func TestShouldPushImmediately(t *testing.T) {
	// TODO WIP

did you want to do anything about this?

This commit removes the interaction between the TxnWaitQueue and the Replica-level retry loop when transaction deadlocks are detected. Instead of instructing a waiting PushTxnRequest to change to a force push request, the queue now launches push abort directly. This is cleaner and avoids the need to mutate the input batch in Replica.maybeWaitForPushee. Release note: None

nvanbenschoten

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/storage/txnwait/queue_test.go, line 31 at r5 (raw file):

Previously, ajwerner wrote…

did you want to do anything about this?

Done.

nvanbenschoten · 2020-01-02T19:41:45Z

bors r+

craig · 2020-01-02T20:05:16Z

Build failed

GitHub CI (Cockroach)

nvanbenschoten · 2020-01-02T20:27:35Z

F200102 20:05:04.162081 1531172 sqlmigrations/migrations.go:531  [n1] not enough time left on migration lease, terminating for safety

Huh, looks like a resurfacing of #21444, which we never tracked down but were able to correlate with slow TeamCity workers. We do see some serious slowdown in the logs of this test, like handle raft ready: 16.5s.

I'm not able to reproduce with stress or stressrace, so I'm going to chalk this up to a flake for now.

bors r+

42765: storage: write to AbortSpan less, clean up AbortSpan more r=nvanbenschoten a=nvanbenschoten This PR introduces two improvements to our handling of the AbortSpan. It also introduces the first set of comprehensive testing around AbortSpan entry state transitions, which was sorely missing. This comes after a few different customer issues that at least at first glance appeared to be AbortSpan leaks. There's still more to do to resolve those, mostly by improving GC, but anything we can do here on the frontend to reduce the number of AbortSpan entries that need to be GCed in the first place helps. ### Clean up span on non-poisoning, aborting EndTransaction request Fixes #29128. Before this change, an EndTransaction request sent to rollback a transaction record would not remove any AbortSpan entries, even if its own Poison flag was set to false. This allowed AbortSpan entries to leak. This commit fixes this behavior by removing the AbortSpan entry in this case. There were concerns raised in #29128 about this being safe. It turns out that we already do this on every Range except the transaction record's Range during normal intent resolution, so this isn't introducing any new concerns. ### Only write AbortSpan entries if intents are removed This reduces the frequency of AbortSpan entries that can be abandoned even without a transaction coordinator failure. Specifically, it protects against the case where intent resolution races with a transaction coordinator cleaning up its own transaction record and intents. This can happen for both aborted and committed transactions. In the first case, a pusher might find a transaction's intent and then find its record to be aborted after that transaction had cleanly rolled back its own intents. Even though the transaction's coordinator had already cleaned up and potentially "unpoisoned" AbortSpans, the pusher would happily re-introduce AbortSpan records when it goes to resolve the intents that were already cleaned up. These AbortSpan entries would be fully abandoned and have to wait out the GC. Similarly, in the second case, the transaction might have committed. Here, the pushee might hit an intent and the txn coordinator might clean up and auto-GC its txn record before the pushee arrives at the txn record. Once the pushee gets there, it would mistake the txn for aborted (by design) and proceed to write an AbortSpan record where the intent it had once observed had been (not by design). We can tell both of these cases by simply recognizing whether intent resolution actually succeeds. If intent resolution doesn't find an intent, then we might be in either case. That's fine, because we only need to ever poison the abort span if we actually remove an intent that could confuse a zombie transaction. 43383: storage: stop modifying requests and returning responses from TxnWaitQueue r=nvanbenschoten a=nvanbenschoten This PR contains three pieces of cleanup that pave the way for pulling the TxnWaitQueue underneath the upcoming `pkg/storage/concurrency` package. It removes all cases where the queue mandates an update to the PushTxnRequest that entered it. It also removes all cases where the queue could field requests directly and return results for them. This restricts the TxnWaitQueue's interface and allows it to be pushed into the concurrency manager abstraction without bloating this new abstraction. Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten · 2020-01-02T20:56:06Z

#41482 (comment) just hit the same issue, so something's up, but it's not due to this PR.

craig · 2020-01-02T21:19:58Z

Build succeeded

And happy new year from bors! 🎉

GitHub CI (Cockroach)

nvanbenschoten requested a review from ajwerner December 20, 2019 01:15

nvanbenschoten force-pushed the nvanbenschoten/pushTxnNoUpdate branch from d76a516 to ad0ed45 Compare December 20, 2019 02:00

ajwerner approved these changes Dec 20, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/pushTxnNoUpdate branch from ad0ed45 to c7dd87c Compare December 28, 2019 03:44

nvanbenschoten commented Dec 28, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/pushTxnNoUpdate branch from c7dd87c to c601328 Compare December 30, 2019 05:36

ajwerner approved these changes Dec 30, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/pushTxnNoUpdate branch from c601328 to a311978 Compare January 2, 2020 18:22

nvanbenschoten commented Jan 2, 2020

View reviewed changes

craig bot merged commit a311978 into cockroachdb:master Jan 2, 2020

nvanbenschoten deleted the nvanbenschoten/pushTxnNoUpdate branch January 2, 2020 21:27

nvanbenschoten mentioned this pull request Feb 11, 2020

storage/concurrency: prep lock-table for 20.1 release #44976

Closed

35 tasks

pbardea mentioned this pull request Aug 19, 2020

sql: TestMigrateSchemaChanges failed #52244

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: stop modifying requests and returning responses from TxnWaitQueue #43383

storage: stop modifying requests and returning responses from TxnWaitQueue #43383

nvanbenschoten commented Dec 20, 2019

cockroach-teamcity commented Dec 20, 2019

ajwerner left a comment

nvanbenschoten left a comment

ajwerner left a comment

nvanbenschoten left a comment

nvanbenschoten commented Jan 2, 2020

craig bot commented Jan 2, 2020

nvanbenschoten commented Jan 2, 2020

nvanbenschoten commented Jan 2, 2020

craig bot commented Jan 2, 2020

storage: stop modifying requests and returning responses from TxnWaitQueue #43383

storage: stop modifying requests and returning responses from TxnWaitQueue #43383

Conversation

nvanbenschoten commented Dec 20, 2019

cockroach-teamcity commented Dec 20, 2019

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jan 2, 2020

craig bot commented Jan 2, 2020

Build failed

nvanbenschoten commented Jan 2, 2020

nvanbenschoten commented Jan 2, 2020

craig bot commented Jan 2, 2020

Build succeeded