Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL ]Unexpected Catalog Snapshot Invalidation (18446744073709551615) During 'Wait on Conflict' G-Flag Toggle Stress Test – Possible int to uint Casting(-1) #24021

Open
1 task done
shishir2001-yb opened this issue Sep 19, 2024 · 0 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation QA QA filed bugs status/awaiting-triage Issue awaiting triage

Comments

@shishir2001-yb
Copy link

shishir2001-yb commented Sep 19, 2024

Jira Link: DB-12909

Description

Version: 2.23.1.0-b41
Logs: Added in Jira

We encountered this issue again during a Wait on Conflict G-flag toggle on/off stress test.

2024-09-18 19:33:36,518 [Thread-3] ERROR SqlReadCommitted - Error occurred during Write INSERT! 
com.yugabyte.util.PSQLException: ERROR: The catalog snapshot used for this transaction has been invalidated: expected: 18446744073709551615, got: 131: MISMATCHED_SCHEMA

Steps to repro:

        1. Create a cluster with required g-flags
        2. Crate 2 databases(1 colocated and 1 noncolocated)
        3. Start SqlBankWaitOnConflict workload on both the databases with and RC and RR
            isolation level
        4. Start SQL_READ_COMMITTED workload on both the database
        5. Start a loop and run it for 4 hours
            a. Disable wait_queses g-flag if enabled or vice versa

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shishir2001-yb shishir2001-yb added area/ysql Yugabyte SQL (YSQL) QA QA filed bugs status/awaiting-triage Issue awaiting triage qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation labels Sep 19, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Sep 19, 2024
myang2021 added a commit that referenced this issue Sep 20, 2024
Summary:
The bug appeared in a recent integration test run and had the following symptom:

In ./Universe_logs/172.151.31.81/tserver/yb-tserver.ip-172-151-31-81.us-west-2.compute.internal.yugabyte.log.INFO.20240918-192635.1116559

```
W0918 19:33:33.059890 1123973 tablet_rpc.cc:497] Query error (yb/tserver/service_util.h:330): Failed Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 00000000000000000000000000000000 on tablet server { uuid: b7b95ea542c642998d053ebba298a46a private: [host: "172.151.31.81" port: 7100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" after 2 attempt(s): The catalog snapshot used for this transaction has been invalidated: expected: 18446744073709551615, got: 131: MISMATCHED_SCHEMA (tablet server error 5)
```

Note the last breaking catalog version is 18446744073709551615 (-1 in int64)
which is unreasonably big. The version check is done by tserver, the expected
last breaking catalog version comes from the map `ysql_db_catalog_version_map_` by
using `db_oid` as the key. The map gets its value from the tserver-master
heartbeat response where we find the contents of the table
`pg_yb_catalog_version`. The new contents of `pg_yb_catalog_version` are merged
with the existing `ysql_db_catalog_version_map_` where we only insert/update the
map when the new version is greater than the existing value.

I added a new gflag `--TEST_check_catalog_version_overflow`, when set to true,
will crash the tserver if the new version read from the heartbeat response is
unreasonably big (i.e., becomes negative when casted to int64_t).

Similar debugging logic is added to the master side as well. When the contents
of `pg_yb_catalog_version` are read by yb-master to prepare the heartbeat
response, if the version read from the table `pg_yb_catalog_version` is
unreasonably big, we crash the master process.

Also added `GUARDED_BY(lock_)` and `EXCLUDES(lock_)` to a few relevant functions.

It is expected that this `--TEST_check_catalog_version_overflow` gflag is
enabled in the integration test which showed the bug. If the bug has a repro, we
may have a better clue on where the number 18446744073709551615 comes from.
Jira: DB-12909

Test Plan:
Manual test
(1) create a local cluster and start the cluster with the new test gflag set:

```
./bin/yb-ctl create --rf 1 --tserver_flags TEST_check_catalog_version_overflow=true --master_flags TEST_check_catalog_version_overflow=true

```

(2) run the following commands:
```
yugabyte=# select * from pg_yb_catalog_version;
 db_oid | current_version | last_breaking_version
--------+-----------------+-----------------------
      1 |               1 |                     1
  13254 |               1 |                     1
  13255 |               1 |                     1
  13257 |               1 |                     1
  13258 |               1 |                     1
(5 rows)
yugabyte=# SET yb_non_ddl_txn_for_sys_tables_allowed=1;
SET
yugabyte=# update pg_yb_catalog_version set current_version = -1, last_breaking_version = -1 where db_oid = 13257;
UPDATE 1
yugabyte=# \q
```

Look into the yb-master log directory and saw a FATAL:

```
F0919 20:44:08.961647 2712654 sys_catalog.cc:1063] Check failed: static_cast<int64_t>(current_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257
```

(3) Repeat the above test with the master side changed as:
```
+      if (FLAGS_TEST_check_catalog_version_overflow && false) {
```

so that we can see the tserver FATAL:

```
F0919 20:52:29.832093 2715720 tablet_server.cc:968] Check failed: static_cast<int64_t>(new_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 db_catalog_version_data: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13254 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13255 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13257 current_version: 18446744073709551615 last_breaking_version: 18446744073709551615 } db_catalog_versions { db_oid: 13258 current_version: 1 last_breaking_version: 1 }
```

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D38240
myang2021 added a commit that referenced this issue Sep 21, 2024
…flow

Summary:
The bug appeared in a recent integration test run and had the following symptom:

In ./Universe_logs/172.151.31.81/tserver/yb-tserver.ip-172-151-31-81.us-west-2.compute.internal.yugabyte.log.INFO.20240918-192635.1116559

```
W0918 19:33:33.059890 1123973 tablet_rpc.cc:497] Query error (yb/tserver/service_util.h:330): Failed Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 00000000000000000000000000000000 on tablet server { uuid: b7b95ea542c642998d053ebba298a46a private: [host: "172.151.31.81" port: 7100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" after 2 attempt(s): The catalog snapshot used for this transaction has been invalidated: expected: 18446744073709551615, got: 131: MISMATCHED_SCHEMA (tablet server error 5)
```

Note the last breaking catalog version is 18446744073709551615 (-1 in int64)
which is unreasonably big. The version check is done by tserver, the expected
last breaking catalog version comes from the map `ysql_db_catalog_version_map_` by
using `db_oid` as the key. The map gets its value from the tserver-master
heartbeat response where we find the contents of the table
`pg_yb_catalog_version`. The new contents of `pg_yb_catalog_version` are merged
with the existing `ysql_db_catalog_version_map_` where we only insert/update the
map when the new version is greater than the existing value.

I added a new gflag `--TEST_check_catalog_version_overflow`, when set to true,
will crash the tserver if the new version read from the heartbeat response is
unreasonably big (i.e., becomes negative when casted to int64_t).

Similar debugging logic is added to the master side as well. When the contents
of `pg_yb_catalog_version` are read by yb-master to prepare the heartbeat
response, if the version read from the table `pg_yb_catalog_version` is
unreasonably big, we crash the master process.

Also added `GUARDED_BY(lock_)` and `EXCLUDES(lock_)` to a few relevant functions.

It is expected that this `--TEST_check_catalog_version_overflow` gflag is
enabled in the integration test which showed the bug. If the bug has a repro, we
may have a better clue on where the number 18446744073709551615 comes from.
Jira: DB-12909

Original commit: bb93ebe / D38240

Test Plan:
Manual test
(1) create a local cluster and start the cluster with the new test gflag set:

```
./bin/yb-ctl create --rf 1 --tserver_flags TEST_check_catalog_version_overflow=true --master_flags TEST_check_catalog_version_overflow=true

```

(2) run the following commands:
```
yugabyte=# select * from pg_yb_catalog_version;
 db_oid | current_version | last_breaking_version
--------+-----------------+-----------------------
      1 |               1 |                     1
  13254 |               1 |                     1
  13255 |               1 |                     1
  13257 |               1 |                     1
  13258 |               1 |                     1
(5 rows)
yugabyte=# SET yb_non_ddl_txn_for_sys_tables_allowed=1;
SET
yugabyte=# update pg_yb_catalog_version set current_version = -1, last_breaking_version = -1 where db_oid = 13257;
UPDATE 1
yugabyte=# \q
```

Look into the yb-master log directory and saw a FATAL:

```
F0919 20:44:08.961647 2712654 sys_catalog.cc:1063] Check failed: static_cast<int64_t>(current_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257
```

(3) Repeat the above test with the master side changed as:
```
+      if (FLAGS_TEST_check_catalog_version_overflow && false) {
```

so that we can see the tserver FATAL:

```
F0919 20:52:29.832093 2715720 tablet_server.cc:968] Check failed: static_cast<int64_t>(new_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 db_catalog_version_data: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13254 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13255 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13257 current_version: 18446744073709551615 last_breaking_version: 18446744073709551615 } db_catalog_versions { db_oid: 13258 current_version: 1 last_breaking_version: 1 }
```

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38282
myang2021 added a commit that referenced this issue Sep 21, 2024
…flow

Summary:
The bug appeared in a recent integration test run and had the following symptom:

In ./Universe_logs/172.151.31.81/tserver/yb-tserver.ip-172-151-31-81.us-west-2.compute.internal.yugabyte.log.INFO.20240918-192635.1116559

```
W0918 19:33:33.059890 1123973 tablet_rpc.cc:497] Query error (yb/tserver/service_util.h:330): Failed Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 00000000000000000000000000000000 on tablet server { uuid: b7b95ea542c642998d053ebba298a46a private: [host: "172.151.31.81" port: 7100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" after 2 attempt(s): The catalog snapshot used for this transaction has been invalidated: expected: 18446744073709551615, got: 131: MISMATCHED_SCHEMA (tablet server error 5)
```

Note the last breaking catalog version is 18446744073709551615 (-1 in int64)
which is unreasonably big. The version check is done by tserver, the expected
last breaking catalog version comes from the map `ysql_db_catalog_version_map_` by
using `db_oid` as the key. The map gets its value from the tserver-master
heartbeat response where we find the contents of the table
`pg_yb_catalog_version`. The new contents of `pg_yb_catalog_version` are merged
with the existing `ysql_db_catalog_version_map_` where we only insert/update the
map when the new version is greater than the existing value.

I added a new gflag `--TEST_check_catalog_version_overflow`, when set to true,
will crash the tserver if the new version read from the heartbeat response is
unreasonably big (i.e., becomes negative when casted to int64_t).

Similar debugging logic is added to the master side as well. When the contents
of `pg_yb_catalog_version` are read by yb-master to prepare the heartbeat
response, if the version read from the table `pg_yb_catalog_version` is
unreasonably big, we crash the master process.

Also added `GUARDED_BY(lock_)` and `EXCLUDES(lock_)` to a few relevant functions.

It is expected that this `--TEST_check_catalog_version_overflow` gflag is
enabled in the integration test which showed the bug. If the bug has a repro, we
may have a better clue on where the number 18446744073709551615 comes from.
Jira: DB-12909

Original commit: bb93ebe / D38240

Test Plan:
Manual test
(1) create a local cluster and start the cluster with the new test gflag set:

```
./bin/yb-ctl create --rf 1 --tserver_flags TEST_check_catalog_version_overflow=true --master_flags TEST_check_catalog_version_overflow=true

```

(2) run the following commands:
```
yugabyte=# select * from pg_yb_catalog_version;
 db_oid | current_version | last_breaking_version
--------+-----------------+-----------------------
      1 |               1 |                     1
  13254 |               1 |                     1
  13255 |               1 |                     1
  13257 |               1 |                     1
  13258 |               1 |                     1
(5 rows)
yugabyte=# SET yb_non_ddl_txn_for_sys_tables_allowed=1;
SET
yugabyte=# update pg_yb_catalog_version set current_version = -1, last_breaking_version = -1 where db_oid = 13257;
UPDATE 1
yugabyte=# \q
```

Look into the yb-master log directory and saw a FATAL:

```
F0919 20:44:08.961647 2712654 sys_catalog.cc:1063] Check failed: static_cast<int64_t>(current_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257
```

(3) Repeat the above test with the master side changed as:
```
+      if (FLAGS_TEST_check_catalog_version_overflow && false) {
```

so that we can see the tserver FATAL:

```
F0919 20:52:29.832093 2715720 tablet_server.cc:968] Check failed: static_cast<int64_t>(new_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 db_catalog_version_data: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13254 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13255 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13257 current_version: 18446744073709551615 last_breaking_version: 18446744073709551615 } db_catalog_versions { db_oid: 13258 current_version: 1 last_breaking_version: 1 }
```

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38284
foucher pushed a commit that referenced this issue Sep 24, 2024
Summary:
 5d3e83e [PLAT-15199] Change TP API URLs according to latest refactoring
 a50a730 [doc][yba] YBDB compatibility (#23984)
 0c84dbe [#24029] Update the callhome diagnostics  not to send gflags details.
 b53ed3a [PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule
 f0eab8f [PLAT-15278]: Fix DB Scoped XCluster replication restart
 344bc76 Revert "[PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule"
 3628ba7 [PLAT-14459] Swagger fix
 bb93ebe [#24021] YSQL: Add --TEST_check_catalog_version_overflow
 9ab7806 [#23927] docdb: Add gflag for minimum thread stack size
 Excluded: 8c8adc0 [#18822] YSQL: Gate update optimizations behind preview flag
 5e86515 [#23768] YSQL: Fix table rewrite DDL before slot creation
 123d496 [PLAT-14682] Universe task should only unlock itself and make unlock aware of the lock config
 de9d4ad [doc][yba] CIS hardened OS support (#23789)
 e131b20 [#23998] DocDB: Update usearch and other header-only third-party dependencies
 1665662 Automatic commit by thirdparty_tool: update usearch to commit 240fe9c298100f9e37a2d7377b1595be6ba1f412.
 3adbdae Automatic commit by thirdparty_tool: update fp16 to commit 98b0a46bce017382a6351a19577ec43a715b6835.
 9a819f7 Automatic commit by thirdparty_tool: update hnswlib to commit 2142dc6f4dd08e64ab727a7bbd93be7f732e80b0.
 2dc58f4 Automatic commit by thirdparty_tool: update simsimd to tag v5.1.0.
 9a03432 [doc][ybm] Azure private link host (#24086)
 039c9a2 [#17378] YSQL: Testing for histogram_bounds in pg_stats
 09f7a0f [#24085] DocDB: Refactor HNSW wrappers
 555af7d [#24000] DocDB: Shutting down shared exchange could cause TServer to hang
 5743a03 [PLAT-15317]Alert emails are not in the correct format.
 8642555 [PLAT-15379][Fix PLAT-12510] Option to use UTC when dealing with cron exp. in backup schedule
 253ab07 [PLAT-15400][PLAT-15401][PLAT-13051] - Connection pooling ui issues and other ui issues
 57576ae [#16487] YSQL: Fix flakey TestPostgresPid test
 bc8ae45 Update ports for CIS hardened (#24098)
 6fa33e6 [#18152, #18729] Docdb: Fix test TestPgIndexSelectiveUpdate
 cc6d2d1 [docs] added and updated cves (#24046)
 Excluded: ed153dc [#24055] YSQL: fix pg_hint_plan regression with executing prepared statement

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, jenkins-bot

Differential Revision: https://phorge.dev.yugabyte.com/D38322
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue qa_automation Bugs identified via itest-system, LST, Stress automation or causing automation failures qa_stress Bugs identified via Stress automation QA QA filed bugs status/awaiting-triage Issue awaiting triage
Projects
None yet
Development

No branches or pull requests

3 participants