storage/concurrency: prep lock-table for 20.1 release #44976

nvanbenschoten · 2020-02-11T18:31:35Z

Release 20.1 will contain a new approach to kv-level transaction locking. Instead of storing all locks inline in the MVCC keyspace and finding locks during request evaluation, locks will now be maintained in a separate lock-table. While replicated locks (intents) will not be pulled into a persistent lock-table keyspace during this release, the lock-table structure will track these locks, as well as unreplicated (best-effort) locks which enable SELECT FOR UPDATE (see #40205). Additionally, the lock-table structure will lead to significant improvements in transaction queuing. The lock-table's lock wait-queues will replace the current contentionQueue, leading to improved fairness between transactions and reduced latency now that the lock wait-queues react directly to intent resolution. For more on this change, see #41720.

The rest of this issue enumerates the remaining work items for the lock-table and storage/concurrency package that should get into this release. It does not include any work items that are specific to SELECT FOR UPDATE's use of the storage/concurrency package.

Must Have

Nice To Have

use txn priorities when ordering requests to avoid higher priority request waiting on a lower priority reservation holder @sumeerbhola
benchmark comparison between lazy waiting (current) and eager waiting in contended workloads @sumeerbhola
re-introduce invariant that reservation holder is always the request that acquires locks
distinguish in lockTableGuardImpl between waiting on lock holder and waiting on queued requests storage/concurrency: implement lockTableWaiter #44885
add visibility into state of lockTable through metrics and/or debug pages: kv/concurrency: expose lock table counters, introduce timeseries metrics #67350
simplify state transitions in acquireLock by not removing from lock holder's lockTableGuardImpl.mu.locks
eliminate waiting details except when waiting on lock holder, and do related code simplification @sumeerbhola

Feel free to modify. Please check off when changes merge.

The text was updated successfully, but these errors were encountered:

Informs cockroachdb#41720. Informs cockroachdb#44976. This PR implements the concurrency.Manager interface, which is the core structure that ties together the new concurrency package. The concurrency manager is a structure that sequences incoming requests and provides isolation between requests that intend to perform conflicting operations. During sequencing, conflicts are discovered and any found are resolved through a combination of passive queuing and active pushing. Once a request has been sequenced, it is free to evaluate without concerns of conflicting with other in-flight requests due to the isolation provided by the manager. This isolation is guaranteed for the lifetime of the request but terminates once the request completes. The manager accomplishes this by piecing together the following components in its request sequencing path: - `latchManager` - `lockTable` - `lockTableWaiter` - `txnWaitQueue` The largest part of this change is introducing the datadriven testing framework to deterministically test the concurrency manager. This proved difficult for two reasons: 1. the concurrency manager composes several components to perform it work (latchManager, lockTable, lockTableWaiter, txnWaitQueue). It was difficult to get consistent observability into each of these components in such a way that tests could be run against a set of concurrent requests interacting with them all. 2. the concurrency manager exposes a primarily blocking interface. Requests call `Sequence()` and wait for sequencing to complete. This may block in a number of different places - while waiting on latches, while waiting on locks, and while waiting on other transactions. The most important part of these tests is to assert _where_ a given request blocks based on the current state of the concurrency manager and then assert _how_ the request reacts to a state transition by another request. To address the first problem, the testing harness uses the context-carried tracing infrastructure to track the path of a request. We already had log events scattered throughout these various components, so this did not require digging testing hooks into each of them. Instead, the harness attached a trace recording span to each request and watches as events are added to the span. It then uses these events as the output of the text. To address the second problem, the testing harness introduces a monitor object which manages a collection of "monitored" goroutines. The monitor watches as these goroutines run and keeps track of their goroutine state as is reported by a goroutine dump. During each step of the datadriven test, the monitor allows all goroutines to proceed until they have either terminated or stalled due to cross-goroutine synchronization dependencies. For instance, it waits for all goroutines to stall while receiving from channels. We can be sure that the goroutine dump provides a consistent snapshot of all goroutine states and statuses because `runtime.Stack(all=true)` stops the world when called. This means that when all monitored goroutines are simultaneously stalled, we have a deadlock that can only be resolved by proceeding forward with the test and releasing latches, resolving locks, or committing transactions. This structure worked surprisingly well and has held up to long periods of stressrace.

45062: storage/concurrency: implement concurrency Manager r=nvanbenschoten a=nvanbenschoten Informs #41720. Informs #44976. Initially drawn out in #43775. This PR implements the concurrency.Manager interface, which is the core structure that ties together the new concurrency package. The concurrency manager is a structure that sequences incoming requests and provides isolation between requests that intend to perform conflicting operations. During sequencing, conflicts are discovered and any found are resolved through a combination of passive queuing and active pushing. Once a request has been sequenced, it is free to evaluate without concerns of conflicting with other in-flight requests due to the isolation provided by the manager. This isolation is guaranteed for the lifetime of the request but terminates once the request completes. The manager accomplishes this by piecing together the following components in its request sequencing path: - `latchManager` - `lockTable` - `lockTableWaiter` - `txnWaitQueue` The largest part of this change is introducing the datadriven testing framework to deterministically test the concurrency manager. This proved difficult for two reasons: 1. the concurrency manager composes several components to perform it work (latchManager, lockTable, lockTableWaiter, txnWaitQueue). It was difficult to get consistent observability into each of these components in such a way that tests could be run against a set of concurrent requests interacting with them all. 2. the concurrency manager exposes a primarily blocking interface. Requests call `Sequence()` and wait for sequencing to complete. This may block in a number of different places - while waiting on latches, while waiting on locks, and while waiting on other transactions. The most important part of these tests is to assert _where_ a given request blocks based on the current state of the concurrency manager and then assert _how_ the request reacts to a state transition by another request. To address the first problem, the testing harness uses the context-carried tracing infrastructure to track the path of a request. We already had log events scattered throughout these various components, so this did not require digging testing hooks into each of them. Instead, the harness attached a trace recording span to each request and watches as events are added to the span. It then uses these events as the output of the text. To address the second problem, the testing harness introduces a monitor object which manages a collection of "monitored" goroutines. The monitor watches as these goroutines run and keeps track of their goroutine state as is reported by a goroutine dump. During each step of the datadriven test, the monitor allows all goroutines to proceed until they have either terminated or stalled due to cross-goroutine synchronization dependencies. For instance, it waits for all goroutines to stall while receiving from channels. We can be sure that the goroutine dump provides a consistent snapshot of all goroutine states and statuses because `runtime.Stack(all=true)` stops the world when called. This means that when all monitored goroutines are simultaneously stalled, we have a deadlock that can only be resolved by proceeding forward with the test and releasing latches, resolving locks, or committing transactions. This structure worked surprisingly well and has held up to long periods of stressrace. Co-authored-by: Nathan VanBenschoten <[email protected]>

The Intent message type has been a frequent source of complexity and confusion. cockroachdb#42152/cockroachdb#42582 found this out first hand when it tried to add an IgnoredSeqNums field to the message type. This led to a series of questions that were not easily answered by simply reading about the message type, like: - why would an Intent ever have a non-PENDING status? - why would an Intent ever span multiple keys? - where are Intents created? - is this Intent message containing the authoritative state of the txn? - does this Intent message have a populated IgnoredSeqNums slice? The root problem was that the message type was serving two roles, as was documented in its comment. It both referenced on-disk write intents and served as a way to talk about updates that should be made to these write intents. This commit addresses this issue by splitting Intent into two (wire-compatible) messages with distinct purposes - Intent and LockUpdate. Intent's new sole purpose is now to reference persistent on-disk write intents on return paths of MVCC operations. They are only ever created when on-disk intents are encountered. LockUpdate's sole purpose is to represent updates that are intended for all locks held by the transaction within its span. They are only ever created after their transaction's record has been consulted, either after an EndTxn or a PushTxn request. This split simplifies the use of each of these types and helps make the questions above easier to answer or impossible to even ask (e.g. Intent no longer has a Status field). While here, the commit adds a `durability` field to LockUpdate. This is used to inform the message's recipient about the durability of the lock(s) it is being instructed to update. This will be important soon when cockroachdb#44976 formally introduces unreplicated locks to the key-value layer. I would suggest that reviewers start with the proto changes in `pkg/roachpb` and fan out from there. Everything else is essentially fallout from the typing change. This is still a bit of a WIP as there is more renaming necessary before this could be merged, but I want to get people's opinions on this sooner rather than later because it's partly on the critical path for cockroachdb#44976 and I'm feeling a little pressed for time with that change and the impending stability period.

@knz

45235: roachpb: split Intent proto message into Intent and LockUpdate messages r=nvanbenschoten a=nvanbenschoten The Intent message type has been a frequent source of complexity and confusion. @knz found this out first-hand in #42152/#42582 when he tried to add an `IgnoredSeqNums` field to the message type. This led to a series of questions that were not easily answered by simply reading about the message type, like: - why would an Intent ever have a non-PENDING status? - why would an Intent ever span multiple keys? - where are Intents created? - is this Intent message containing the authoritative state of the txn? - does this Intent message have a populated IgnoredSeqNums slice? The root problem was that the message type was serving two roles, as was documented in its comment. It both referenced on-disk write intents and served as a way to talk about updates that should be made to these write intents. This commit addresses this issue by splitting Intent into two (wire-compatible) messages with distinct purposes - Intent and LockUpdate. Intent's new sole purpose is now to reference persistent on-disk write intents on return paths of MVCC operations. They are only ever created when on-disk intents are encountered. LockUpdate's sole purpose is to represent updates that are intended for all locks held by the transaction within its span. They are only ever created after their transaction's record has been consulted, either after an EndTxn or a PushTxn request. This split simplifies the use of each of these types and helps make the questions above easier to answer or impossible to even ask (e.g. Intent no longer has a Status field). While here, the commit adds a `durability` field to LockUpdate. This is used to inform the message's recipient about the durability of the lock(s) it is being instructed to update. This will be important soon when #44976 formally introduces unreplicated locks to the key-value layer. I would suggest that reviewers start with the proto changes in `pkg/roachpb` and fan-out from there. Everything else is essentially fallout from the typing change. This is still a bit of a WIP as there is more renaming necessary before this could be merged, but I want to get people's opinions on this sooner rather than later because it's partly on the critical path for #44976 and I'm feeling a little pressed for time with that change and the impending stability period. Co-authored-by: Nathan VanBenschoten <[email protected]>

Related to cockroachdb#41720. Related to cockroachdb#44976. This commit integrates the new concurrency package into the storage package. Each Replica is given a concurrency manager, which replaces its existing latch manager and txn wait queue. The change also uses the concurrency manager to simplify the role of the intent resolver. The intent resolver no longer directly handles WriteIntentErrors. As a result, we are able to delete the contention queue entirely. With this change, all requests are now sequenced through the concurrency manager. When sequencing, latches are acquired and conflicting locks are detected. If any locks are found, the requests wait in lock wait-queues for the locks to be resolved. This is a major deviation from how things currently work because today, even with the contention queue, requests end up waiting for conflicting transactions to commit/abort in the txnWaitQueue after at least one RPC. Now, requests wait directly next to the intents/locks that they are waiting on and react directly to the resolution of these intents/locks. Once requests are sequenced by the concurrency manager, they are theoretically fully isolated from all conflicting requests. However, this is not strictly true today because we have not yet pulled all replicated locks into the concurrency manager's lock table. We will do so in a future change. Until then, the concurrency manager maintains a notion of "intent discovery", which is integrated into the Replica-level concurrency retry loop. Performance numbers will be published shortly. This will be followed by performance numbers using the SELECT FOR UPDATE locking (cockroachdb#40205) improvements that this change enables.

45269: importccl: Parallelize avro import r=miretskiy a=miretskiy Parallelize avro importer to improve its throughput (2.8x improvement). Touches #40374. Fixes #45097. Release notes (performance): Faster avro import 45482: storage: integrate Concurrency Manager into Replica request path r=nvanbenschoten a=nvanbenschoten Related to #41720. Related to #44976. This commit integrates the new concurrency package into the storage package. Each Replica is given a concurrency manager, which replaces its existing latch manager and txn wait queue. The change also uses the concurrency manager to simplify the role of the intent resolver. The intent resolver no longer directly handles WriteIntentErrors. As a result, we are able to delete the contention queue entirely. With this change, all requests are now sequenced through the concurrency manager. When sequencing, latches are acquired and conflicting locks are detected. If any locks are found, the requests wait in lock wait-queues for the locks to be resolved. This is a major deviation from how things currently work because today, even with the contention queue, requests end up waiting for conflicting transactions to commit/abort in the txnWaitQueue after at least one RPC. Now, requests wait directly next to the intents/locks that they are waiting on and react directly to the resolution of these intents/locks. Once requests are sequenced by the concurrency manager, they are theoretically fully isolated from all conflicting requests. However, this is not strictly true today because we have not yet pulled all replicated locks into the concurrency manager's lock table. We will do so in a future change. Until then, the concurrency manager maintains a notion of "intent discovery", which is integrated into the Replica-level concurrency retry loop. Performance numbers will be published shortly. This will be followed by performance numbers using the SELECT FOR UPDATE locking (#40205) improvements that this change enables. 45484: sql: simplify connection state machine - stop tracking retry intent r=andreimatei a=andreimatei Before this patch, the SQL connection state machine had an optimization: if a transaction that hadn't used "SAVEPOINT cockroach_restart" encountered a retriable error that we can't auto-retry, then we'd release the txn's locks eagerly and enter the Aborted state. As opposed to transactions that had used the "SAVEPOINT cockroach_restart", which go to RestartWait. This optimization is a significant complication for the state machine, so this patch is removing it. All transactions now go to RestartWait, and wait for a ROLLBACK to release the locks. On the flip side, doing "RELEASE SAVEPOINT cockroach_restart" and "ROLLBACK SAVEPOINT cockroach_restart" now works even for transactions that haven't explicitly declared that savepoint, which is nice. Although I don't promise I'll keep it working. Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andrei Matei <[email protected]>

nvanbenschoten · 2021-09-13T16:05:51Z

Closing. This issue served us well, but it is now stale.

nvanbenschoten added the A-kv-transactions Relating to MVCC and the transactional model. label Feb 11, 2020

nvanbenschoten added this to the 20.1 milestone Feb 11, 2020

nvanbenschoten assigned nvanbenschoten and sumeerbhola Feb 11, 2020

nvanbenschoten mentioned this issue Feb 13, 2020

storage/concurrency: implement concurrency Manager #45062

Merged

nvanbenschoten mentioned this issue Feb 20, 2020

roachpb: split Intent proto message into Intent and LockUpdate messages #45235

Merged

nvanbenschoten mentioned this issue Feb 26, 2020

storage: integrate Concurrency Manager into Replica request path #45482

Merged

knz mentioned this issue May 21, 2020

kv: introduce memory accounting / limits for lock table #49387

Open

jlinder added the T-kv KV Team label Jun 16, 2021

nvanbenschoten closed this as completed Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/concurrency: prep lock-table for 20.1 release #44976

storage/concurrency: prep lock-table for 20.1 release #44976

nvanbenschoten commented Feb 11, 2020 •

edited

Loading

nvanbenschoten commented Sep 13, 2021

storage/concurrency: prep lock-table for 20.1 release #44976

storage/concurrency: prep lock-table for 20.1 release #44976

Comments

nvanbenschoten commented Feb 11, 2020 • edited Loading

Must Have

Nice To Have

nvanbenschoten commented Sep 13, 2021

nvanbenschoten commented Feb 11, 2020 •

edited

Loading