Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[#24021] YSQL: Add --TEST_check_catalog_version_overflow
Summary: The bug appeared in a recent integration test run and had the following symptom: In ./Universe_logs/172.151.31.81/tserver/yb-tserver.ip-172-151-31-81.us-west-2.compute.internal.yugabyte.log.INFO.20240918-192635.1116559 ``` W0918 19:33:33.059890 1123973 tablet_rpc.cc:497] Query error (yb/tserver/service_util.h:330): Failed Read(tablet: 00000000000000000000000000000000, num_ops: 1, num_attempts: 2, txn: 00000000-0000-0000-0000-000000000000, subtxn: [none]) to tablet 00000000000000000000000000000000 on tablet server { uuid: b7b95ea542c642998d053ebba298a46a private: [host: "172.151.31.81" port: 7100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2a" after 2 attempt(s): The catalog snapshot used for this transaction has been invalidated: expected: 18446744073709551615, got: 131: MISMATCHED_SCHEMA (tablet server error 5) ``` Note the last breaking catalog version is 18446744073709551615 (-1 in int64) which is unreasonably big. The version check is done by tserver, the expected last breaking catalog version comes from the map `ysql_db_catalog_version_map_` by using `db_oid` as the key. The map gets its value from the tserver-master heartbeat response where we find the contents of the table `pg_yb_catalog_version`. The new contents of `pg_yb_catalog_version` are merged with the existing `ysql_db_catalog_version_map_` where we only insert/update the map when the new version is greater than the existing value. I added a new gflag `--TEST_check_catalog_version_overflow`, when set to true, will crash the tserver if the new version read from the heartbeat response is unreasonably big (i.e., becomes negative when casted to int64_t). Similar debugging logic is added to the master side as well. When the contents of `pg_yb_catalog_version` are read by yb-master to prepare the heartbeat response, if the version read from the table `pg_yb_catalog_version` is unreasonably big, we crash the master process. Also added `GUARDED_BY(lock_)` and `EXCLUDES(lock_)` to a few relevant functions. It is expected that this `--TEST_check_catalog_version_overflow` gflag is enabled in the integration test which showed the bug. If the bug has a repro, we may have a better clue on where the number 18446744073709551615 comes from. Jira: DB-12909 Test Plan: Manual test (1) create a local cluster and start the cluster with the new test gflag set: ``` ./bin/yb-ctl create --rf 1 --tserver_flags TEST_check_catalog_version_overflow=true --master_flags TEST_check_catalog_version_overflow=true ``` (2) run the following commands: ``` yugabyte=# select * from pg_yb_catalog_version; db_oid | current_version | last_breaking_version --------+-----------------+----------------------- 1 | 1 | 1 13254 | 1 | 1 13255 | 1 | 1 13257 | 1 | 1 13258 | 1 | 1 (5 rows) yugabyte=# SET yb_non_ddl_txn_for_sys_tables_allowed=1; SET yugabyte=# update pg_yb_catalog_version set current_version = -1, last_breaking_version = -1 where db_oid = 13257; UPDATE 1 yugabyte=# \q ``` Look into the yb-master log directory and saw a FATAL: ``` F0919 20:44:08.961647 2712654 sys_catalog.cc:1063] Check failed: static_cast<int64_t>(current_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 ``` (3) Repeat the above test with the master side changed as: ``` + if (FLAGS_TEST_check_catalog_version_overflow && false) { ``` so that we can see the tserver FATAL: ``` F0919 20:52:29.832093 2715720 tablet_server.cc:968] Check failed: static_cast<int64_t>(new_version) >= 0 (-1 vs. 0) 18446744073709551615 db_oid: 13257 db_catalog_version_data: db_catalog_versions { db_oid: 1 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13254 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13255 current_version: 1 last_breaking_version: 1 } db_catalog_versions { db_oid: 13257 current_version: 18446744073709551615 last_breaking_version: 18446744073709551615 } db_catalog_versions { db_oid: 13258 current_version: 1 last_breaking_version: 1 } ``` Reviewers: fizaa Reviewed By: fizaa Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D38240
- Loading branch information