Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] YB does not release LWLocks on AbortTransaction #19509

Closed
1 task done
timothy-e opened this issue Oct 12, 2023 · 3 comments
Closed
1 task done

[YSQL] YB does not release LWLocks on AbortTransaction #19509

timothy-e opened this issue Oct 12, 2023 · 3 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@timothy-e
Copy link
Contributor

timothy-e commented Oct 12, 2023

Jira Link: DB-8305

Description

In PostgreSQL when ereport is called with level = ERROR that automatically invokes the AbortTransaction. Currently as part of AbortTransaction, lwlocks held by the Process are not released.
As the locks are not released, that prevents the subsequent acquisition of the lock and causes stuck PostgreSQL processes.

All the lwlocks should be released as part of AbortTransaction. This can happen in any path when lwlock is acquired and ereport is called with level = ERROR .

One Example Scenario which leads to this:

Deadlock when creating replication slots with duplicate names:

select pg_create_logical_replication_slot('test_slot', 'pgoutput');
select pg_create_logical_replication_slot('test_slot', 'pgoutput');
ERROR:  replication slot "test_slot" already exists --> This did not release the acquired locks.
select pg_create_logical_replication_slot('test_slot', 'pgoutput'); -- now deadlocked

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@timothy-e timothy-e added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Oct 12, 2023
@timothy-e timothy-e self-assigned this Oct 12, 2023
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue and removed status/awaiting-triage Issue awaiting triage labels Oct 12, 2023
@timothy-e
Copy link
Contributor Author

timothy-e commented Oct 12, 2023

There are a seperate type of locks, identified by LOCKTAG that we don't use in YB, so function calls have been disabled. Unfortunately, the LWLockReleaseAll function was accidentally disabled as well.

Fix:

diff --git a/src/postgres/src/backend/access/transam/xact.c b/src/postgres/src/backend/access/transam/xact.c
index 138dc4002f..3fe664da7d 100644
--- a/src/postgres/src/backend/access/transam/xact.c
+++ b/src/postgres/src/backend/access/transam/xact.c
@@ -2691,15 +2691,13 @@ AbortTransaction(void)
        AtAbort_Memory();
        AtAbort_ResourceOwner();
​
-       if (YBIsPgLockingEnabled()) {
-               /*
-               * Release any LW locks we might be holding as quickly as possible.
-               * (Regular locks, however, must be held till we finish aborting.)
-               * Releasing LW locks is critical since we might try to grab them again
-               * while cleaning up!
-               */
-               LWLockReleaseAll();
-       }
+       /*
+        * Release any LW locks we might be holding as quickly as possible.
+        * (Regular locks, however, must be held till we finish aborting.)
+        * Releasing LW locks is critical since we might try to grab them again
+        * while cleaning up!
+        */
+       LWLockReleaseAll();
​
        /* Clear wait information and command progress indicator */
        pgstat_report_wait_end();

@yugabyte-ci yugabyte-ci assigned dr0pdb and unassigned timothy-e Oct 17, 2023
dr0pdb added a commit that referenced this issue Oct 29, 2023
… of replication slots

Summary:
Introduce support for creating, viewing, and dropping replication slots in YSQL. This is part of the project to support Publication/Replication slot API in YSQL (#18724).

There are two interfaces for the support for create and drop:
1. Functions:
    - `pg_create_logical_replication_slot`
    -  `pg_drop_replication_slot`
2. Walsender commands:
    - CREATE_REPLICATION_SLOT
    - DROP_REPLICATION_SLOT

Both create and drop statements go to yb-master directly. Most of the PG code isn't applicable to YSQL yet and hence it is skipped.

For viewing replication slots, we have a view `pg_replication_slots` which is backed by the function `pg_get_replication_slots`. The schema of the view has been modified by adding an extra yb-specific column `yb_stream_id` which is a text.

Limitations:
1. Only `yboutput` plugin is supported. It'll only be relevant once we add support for consuming replication slots but we are enforcing it from this diff onwards

Apart from the above, this diff fixes two issues:
1. #19509 - Cleanup of held locks in case of an `ereport(elevel >= ERROR)`. This diff fixes that by making sure that we call `LWLockReleaseAll` in `src/postgres/src/backend/access/transam/xact.c` in case of an error. Thanks to Timothy Elgersma.
2. Skipping cache refresh in case of an error in the execution of a replication command. `src/postgres/src/backend/tcop/postgres.c`. This is ok because we only cache DMLs and none of the replication commands are DMLs. We need to do that because the check `yb_is_dml_command` tries to parse the query to check whether it is a DML or not but it doesn't support replication commands. So any `ereport(elevel >= ERROR)` in the execution of a replication command would lead to a syntax error.

TODOs for future:
1. This diff creates a CDC stream with CDCRecordType as `CHANGE`. We need to extend the `pg_create_logical_replication_slot` and `CREATE_REPLICATION_SLOT` syntax to take the CDCRecordType. It'll be done in a future diff
2. DROP_REPLICATION_SLOT commands allows waiting for a slot to become inactive before dropping it. It is unsupported currently and will be done in a future diff
3. Temporary replication slots are unsupported. Will be added in future once we also support consumption via Walsender

Upgrade\Rollback safety:
These changes rely on sys-catalog changes done in yb-master. As a result, all the commands are disabled during upgrade using an autoflag yb_enable_replication_commands (LocalPersisted) and will only be enabled once the user has committed to the new version.

The autoflag was introduced during the Publication syntax support and is being reused here since these are both part of the same project: https://phorge.dev.yugabyte.com/D28721
Jira: DB-8008, DB-8009, DB-8305

Test Plan:
New unit test

```
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot'
```

New Regress test
```
./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressReplicationSlot'
```

I've also updated most of the CDCSDK tests to now use the ReplicationSlot commands to create a CDCSDK stream instead of an RPC. Remaining tests will be updated in future diffs

Reviewers: dsrinivasan, skumar, asrinivasan, aagrawal

Reviewed By: dsrinivasan

Subscribers: ycdcxcluster, bogdan, ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29194
@timothy-e
Copy link
Contributor Author

Closed by 7e22b28

@timothy-e timothy-e reopened this Nov 2, 2023
@timothy-e
Copy link
Contributor Author

Reopening to address on 2.14, 2.16, 2.18, 2.20

@timothy-e timothy-e assigned timothy-e and unassigned dr0pdb Nov 2, 2023
@timothy-e timothy-e changed the title [YSQL] Deadlock on duplicate replication slot names [YSQL] YB does not release LWLocks on AbortTransaction Nov 2, 2023
@yugabyte-ci yugabyte-ci assigned dr0pdb and unassigned timothy-e Nov 7, 2023
timothy-e pushed a commit that referenced this issue Nov 7, 2023
…tTransaction

Summary:
Original commit: 7e22b28 / D29194

This revision backports the fix for releasing LWLocks on AbortTransaction. It is imported from the replication slot support revision:
https://phorge.dev.yugabyte.com/D29194.

The issue was introduced in 3f536a8 / D5518
Jira: DB-8305

Test Plan: ```./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestLWPgBackendErrorAfterLWLockAcquire```

Reviewers: telgersma

Reviewed By: telgersma

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30022
dr0pdb added a commit that referenced this issue Nov 8, 2023
…tTransaction

Summary:
Original commit: 7e22b28 / D29194

This revision backports the fix for releasing LWLocks on AbortTransaction. It is imported from the replication slot support revision: https://phorge.dev.yugabyte.com/D29194.

The issue was introduced in 3f536a8 / D5518
Jira: DB-8305

Test Plan:
```
./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestLWPgBackendErrorAfterLWLockAcquire
```

Reviewers: telgersma

Reviewed By: telgersma

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29997
dr0pdb added a commit that referenced this issue Nov 10, 2023
…tTransaction

Summary:
Original commit: 7e22b28 / D29194

This revision backports the fix for releasing LWLocks on AbortTransaction. It is imported from the replication slot support revision: https://phorge.dev.yugabyte.com/D29194.

The issue was introduced in 3f536a8 / D5518
Jira: DB-8305

Test Plan:
```
./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestLWPgBackendErrorAfterLWLockAcquire
```

Reviewers: telgersma

Reviewed By: telgersma

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29995
dr0pdb added a commit that referenced this issue Nov 10, 2023
…tTransaction

Summary:
Original commit: 7e22b28 / D29194

This revision backports the fix for releasing LWLocks on AbortTransaction. It is imported from the replication slot support revision: https://phorge.dev.yugabyte.com/D29194.

The issue was introduced in 3f536a8 / D5518
Jira: DB-8305

Test Plan:
```
./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestLWPgBackendErrorAfterLWLockAcquire
```

Reviewers: telgersma

Reviewed By: telgersma

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29996
jharveysmith pushed a commit that referenced this issue May 24, 2024
…tTransaction

Summary:
Original commit: e6c21c81ebaff62465fb48c665846795d034d1b1 / D29194

This revision backports the fix for releasing LWLocks on AbortTransaction. It is imported from the replication slot support revision:
https://phorge.dev.yugabyte.com/D29194.

The issue was introduced in 3b3f38c / D5518
Jira: DB-8305

Test Plan: ```./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.TestLWPgBackendErrorAfterLWLockAcquire```

Reviewers: telgersma

Reviewed By: telgersma

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D30022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants