Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[BACKPORT 2.16][#19407] YSQL: Single shard operations should pick rea…
…d time only after conflict resolution Summary: Single shard UPDATEs and DELETEs pick a read time on docdb after conflict resolution. However, single shard INSERTs pick a read time on PgClientSession (same node as the query layer). Unlike distributed transactions, conflict resolution of single shard operations in YB is optimized to not check regular db for conflicting data that has been committed. This is correct as long as we don't pick a read time until the operation is sure that no conflicts exist and also holds the in-memory locks (in other words, after conflict resolution is successful). Since a single shard INSERT picks a read time in the query layer (i.e., much earlier before conflict resolution), it can face a correctness issue due to the following steps in order: In Fail-on-Conflict mode (incorrect behaviour with low probability): ------------------------------------------------------------------- Assume two single shard INSERTs trying to insert the same row. (1) Insert 1: pick a read time (rt1=5) on PgClientSession (2) Insert 2: pick a read time (rt2=6) on PgClientSession (3) Insert 1: acquire in-memory locks and do conflict checking (only check intents db). (4) Insert 2: wait for in-memory lock acquisition (5) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1, write row in regular db with commit timestamp 10. (6) Insert 2: acquire in-memory locks and do conflict checking (only check intents db). (7) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2, write row in regular db with commit timestamp 12. (Assume that clock skew is 2 time units, so no kReadRestart is hit) To repro this manually, add a sleep for 1000ms at the start of ApplyInsert() and perform concurrent single shard INSERTs such that the above scenario is hit. Use a sleep higher than the max_clock_skew_usec because if the sleep is lower than that, Insert 2 will face a kReadRestart and will be retried by the query layer, and face a duplicate key error as expected. In Wait-on-Conflict mode (easier to repro manually): ---------------------------------------------------- [Refer newly added test concurrent-inserts-duplicate-key-error] Assume one single shard INSERT and one distributed transaction INSERT trying to insert the same row. (1) Insert 1 (distributed): pick a read time (rt1=5) on PgClientSession (2) Insert 2 (single shard): pick a read time (rt2=6) on PgClientSession (3) Insert 1: acquire in-memory locks and do conflict checking (check intents db & regular db). (4) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1, write row in intents db. (5) Insert 2: acquire in-memory locks, find conflicting intent, release in-memory locks and enter wait queue (6) Insert 1: `commit;` of distributed txn writes data to regular db with commit timestamp 10. (7) Insert 2: wake up from wait queue, acquire in-memory locks again, check for conflicting intents again, none will be found. (8) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2, write row in regular db with commit timestamp 12. The crux of the issue is: since the read time for a single shard INSERT is picked before conflict resolution, it will miss reading rows in regular db that are concurrently committed by another transaction with a timestamp higher than the read time. Solution: --------- (1) Add a check in docdb to error out if a read time exists before conflict resolution in the single shard operation path. (2) For single shard UPDATEs and DELETEs, we pick a read time on docdb because EnsureReadTimeIsSet is false in pg_session.cc for these. For single shard INSERTs we currently set EnsureReadTimeIsSet to true, which is not necessary. Changing it to false fixes the above issue because the read time will now be picked on docdb after conflict resolution is done. Picking the read time on docdb has some extra advantages too: (i) not having to wait for the safe time to catchup to a read time picked on another node's PgClientSession. (ii) docdb can retry various errors (such as kReadRestart, kConflict) itself without going back to the query layer. Apart from fixing a correctness issue, this change is similar to two earlier commits that ensure we pick read times on docdb in as many cases as possible: 8166695 and b223af9. Jira: DB-8199 Original commit: fc21068 / D29133 Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress (added concurrent-inserts-duplicate-key-error in this) Reviewers: dmitry, bkolagani, rsami, sergei, tvesely Reviewed By: dmitry Subscribers: yql, rsami, bkolagani, ybase, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D29856
- Loading branch information