Skip to content

Commit

Permalink
[BACKPORT 2.16][#19407] YSQL: Single shard operations should pick rea…
Browse files Browse the repository at this point in the history
…d time only after conflict resolution

Summary:
Single shard UPDATEs and DELETEs pick a read time on docdb after
conflict resolution. However, single shard INSERTs pick a read
time on PgClientSession (same node as the query layer).

Unlike distributed transactions, conflict resolution of
single shard operations in YB is optimized to not check
regular db for conflicting data that has been committed. This
is correct as long as we don't pick a read time until
the operation is sure that no conflicts exist and also holds
the in-memory locks (in other words, after conflict resolution
is successful).

Since a single shard INSERT picks a read time in the query
layer (i.e., much earlier before conflict resolution), it can
face a correctness issue due to the following steps in order:

In Fail-on-Conflict mode (incorrect behaviour with low probability):
-------------------------------------------------------------------

Assume two single shard INSERTs trying to insert the same row.

(1) Insert 1: pick a read time (rt1=5) on PgClientSession
(2) Insert 2: pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (only check intents db).
(4) Insert 2: wait for in-memory lock acquisition
(5) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in regular db with commit timestamp 10.
(6) Insert 2: acquire in-memory locks and do conflict checking (only check intents db).
(7) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12. (Assume that clock skew is 2 time
units, so no kReadRestart is hit)

To repro this manually, add a sleep for 1000ms at the start of ApplyInsert()
and perform concurrent single shard INSERTs such that the above scenario is hit.
Use a sleep higher than the max_clock_skew_usec because if the sleep is lower than
that, Insert 2 will face a kReadRestart and will be retried by the query
layer, and face a duplicate key error as expected.

In Wait-on-Conflict mode (easier to repro manually):
----------------------------------------------------
[Refer newly added test concurrent-inserts-duplicate-key-error]

Assume one single shard INSERT and one distributed transaction INSERT trying to
insert the same row.

(1) Insert 1 (distributed): pick a read time (rt1=5) on PgClientSession
(2) Insert 2 (single shard): pick a read time (rt2=6) on PgClientSession
(3) Insert 1: acquire in-memory locks and do conflict checking (check intents db & regular db).
(4) Insert 1: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt1,
write row in intents db.
(5) Insert 2: acquire in-memory locks, find conflicting intent, release in-memory locks and
enter wait queue
(6) Insert 1: `commit;` of distributed txn writes data to regular db with commit timestamp 10.
(7) Insert 2: wake up from wait queue, acquire in-memory locks again, check for conflicting
intents again, none will be found.
(8) Insert 2: check for duplicate row in ApplyInsert(). Since no duplicates exist as of rt2,
write row in regular db with commit timestamp 12.

The crux of the issue is: since the read time for a single shard INSERT
is picked before conflict resolution, it will miss reading rows in regular db
that are concurrently committed by another transaction with a timestamp
higher than the read time.

Solution:
---------

(1) Add a check in docdb to error out if a read time exists before conflict
resolution in the single shard operation path.

(2) For single shard UPDATEs and DELETEs, we pick a read time on docdb
because EnsureReadTimeIsSet is false in pg_session.cc for these. For
single shard INSERTs we currently set EnsureReadTimeIsSet to true, which
is not necessary. Changing it to false fixes the above issue because the
read time will now be picked on docdb after conflict resolution is done.

Picking the read time on docdb has some extra advantages too:

(i) not having to wait for the safe time to catchup to a read time picked
on another node's PgClientSession.

(ii) docdb can retry various errors (such as kReadRestart, kConflict) itself
without going back to the query layer.

Apart from fixing a correctness issue, this change is similar to two earlier
commits that ensure we pick read times on docdb in as many cases as
possible: 8166695 and b223af9.
Jira: DB-8199

Original commit: fc21068 / D29133

Test Plan: ./yb_build.sh --java-test org.yb.pgsql.TestPgIsolationRegress (added concurrent-inserts-duplicate-key-error in this)

Reviewers: dmitry, bkolagani, rsami, sergei, tvesely

Reviewed By: dmitry

Subscribers: yql, rsami, bkolagani, ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29856
  • Loading branch information
pkj415 committed Nov 8, 2023
1 parent eb8d237 commit 0cdc87d
Show file tree
Hide file tree
Showing 5 changed files with 96 additions and 2 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Parsed test spec with 3 sessions

starting permutation: s1_ins s2_ins s1_commit s1_select
step s1_ins: INSERT INTO test VALUES (2, 2);
step s2_ins: INSERT INTO test VALUES (2, 4); <waiting ...>
step s1_commit: COMMIT;
step s2_ins: <... completed>
error in steps s1_commit s2_ins: ERROR: duplicate key value violates unique constraint "test_pkey"
step s1_select: SELECT * FROM test;
k v

2 2

starting permutation: s1_ins s3_del s2_ins s1_commit s1_select
step s1_ins: INSERT INTO test VALUES (2, 2);
step s3_del: DELETE FROM test WHERE k=2; <waiting ...>
step s2_ins: INSERT INTO test VALUES (2, 4); <waiting ...>
step s1_commit: COMMIT;
step s3_del: <... completed>
step s2_ins: <... completed>
step s1_select: SELECT * FROM test;
k v

2 4

starting permutation: s1_ins s2_multi_insert s1_commit s1_select
step s1_ins: INSERT INTO test VALUES (2, 2);
step s2_multi_insert: INSERT INTO test VALUES (1, 1), (2, 4), (3, 3); <waiting ...>
step s1_commit: COMMIT;
step s2_multi_insert: <... completed>
error in steps s1_commit s2_multi_insert: ERROR: duplicate key value violates unique constraint "test_pkey"
step s1_select: SELECT * FROM test;
k v

2 2

starting permutation: s1_ins s3_del s2_multi_insert s1_commit s1_select
step s1_ins: INSERT INTO test VALUES (2, 2);
step s3_del: DELETE FROM test WHERE k=2; <waiting ...>
step s2_multi_insert: INSERT INTO test VALUES (1, 1), (2, 4), (3, 3); <waiting ...>
step s1_commit: COMMIT;
step s3_del: <... completed>
step s2_multi_insert: <... completed>
step s1_select: SELECT * FROM test;
k v

1 1
2 4
3 3
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
setup
{
CREATE TABLE test (k int PRIMARY KEY, v int);
}

teardown
{
DROP TABLE test;
}

session "s1"
setup { BEGIN; }
step "s1_ins" { INSERT INTO test VALUES (2, 2); }
step "s1_commit" { COMMIT; }
step "s1_select" { SELECT * FROM test; }

session "s2"
step "s2_ins" { INSERT INTO test VALUES (2, 4); }
step "s2_multi_insert" { INSERT INTO test VALUES (1, 1), (2, 4), (3, 3); }

session "s3"
step "s3_del" { DELETE FROM test WHERE k=2; }

permutation "s1_ins" "s2_ins" "s1_commit" "s1_select"
permutation "s1_ins" "s3_del" "s2_ins" "s1_commit" "s1_select"
permutation "s1_ins" "s2_multi_insert" "s1_commit" "s1_select"
permutation "s1_ins" "s3_del" "s2_multi_insert" "s1_commit" "s1_select"
3 changes: 2 additions & 1 deletion src/postgres/src/test/isolation/yb_wait_queues_schedule
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
test: yb-wait-queues-single-shard-waiter-unlocks
test: yb-wait-queues-single-shard-waiters-same-tablet
test: yb-wait-queues-weak-read-unlocks
test: yb-wait-queues-weak-read-unlocks
test: concurrent-inserts-duplicate-key-error
15 changes: 15 additions & 0 deletions src/yb/tablet/write_query.cc
Original file line number Diff line number Diff line change
Expand Up @@ -509,6 +509,21 @@ Status WriteQuery::DoExecute() {
auto now = tablet().clock()->Now();
auto conflict_management_policy = GetConflictManagementPolicy(
tablet().wait_queue(), write_batch);

{
// TODO(#19498): Enable the below check if possible. Right now we can't enable it because the
// read time for all YCQL operations is picked on the YCQL query layer, and this might be
// indicative of some correctness bugs similar to #19407 which was seen on YSQL.

// Read time should not be picked until conflict resolution is successful for the single shard
// operation path. This is because ResolveOperationConflicts() doesn't check regular db for
// conflicting data committed in regular db. If in future, we have to read data before
// conflict resolution, we should check conflicts in regular db too.

// RSTATUS_DCHECK(
// !read_time_, IllegalState,
// "Read time was picked before conflict resolution for a single shard operation.");
}
return docdb::ResolveOperationConflicts(
doc_ops_, conflict_management_policy, now, tablet().doc_db(),
partial_range_key_intents, transaction_participant,
Expand Down
4 changes: 3 additions & 1 deletion src/yb/yql/pggate/pg_dml_write.cc
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,9 @@ Status PgDmlWrite::Exec(ForceNonBufferable force_non_bufferable) {

// Execute the statement. If the request has been sent, get the result and handle any rows
// returned.
if (VERIFY_RESULT(doc_op_->Execute(force_non_bufferable)) == RequestSent::kTrue) {
if (VERIFY_RESULT(doc_op_->Execute(ForceNonBufferable(
force_non_bufferable.get() ||
(transaction_setting_ == YB_SINGLE_SHARD_TRANSACTION)))) == RequestSent::kTrue) {
rowsets_.splice(rowsets_.end(), VERIFY_RESULT(doc_op_->GetResult()));

// Save the number of rows affected by the op.
Expand Down

0 comments on commit 0cdc87d

Please sign in to comment.