Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB][DB Clone][YSQL] Colocated DB Cloning fails to the time before table creation #21625

Closed
1 task done
Arjun-yb opened this issue Mar 21, 2024 · 0 comments
Closed
1 task done
Assignees
Labels
2.23_blocker area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@Arjun-yb
Copy link
Contributor

Arjun-yb commented Mar 21, 2024

Jira Link: DB-10520

Description

Version: 2.23.0.0-b36
Steps:

  1. Create colocated database(demo1) and create snapshot schedule
  2. Create table and load some data
  3. Collect time(t1)
  4. Create one more table
  5. Clone demo1 to demo2 at time t1

Observation:
DB clone command works and returns namespace id and seq number, but is_clone_done command returns false and there wont be any cloned DB

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@Arjun-yb Arjun-yb added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Mar 21, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Mar 21, 2024
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Mar 26, 2024
@yugabyte-ci yugabyte-ci closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2024
yamen-haddad added a commit that referenced this issue Sep 17, 2024
…lone

Summary:
As part of the clone workflow, we repartition all the tables of the target database that has been created by executing the dump script. This means removing the old tablets and creating new ones during the import snapshot phase. However, we saw some cases where the old tablets are cached in the meta-cache of the tserver that executed the schema creation script. Other tservers can also have these stale metacache entries. For example, as part of executing `CREATE INDEX`, we send `BACKFILL INDEX` queries to the tserves that host the base table tablets' leaders which populates the cache with old tablets. The stale meta-cache entries are used later to execute the queries that arrive to tservers. However, the stale tablets are deleted in the import snapshot phase which leads to the following error:
```
d3=# select count(*) from t2 where age<18;
ERROR:  LookupByIdRpc(tablet: 89b4445772d2415aa1702a77031b7d74, num_attempts: 2) failed: Tablet deleted: Not serving tablet deleted upon request at 2024-08-01 15:39:31 UTC
```
It is worth mentioning that we encounter this issue only in the first query that is executed in the tserver with stale metacache. If we retry the same query another time, it will work fine as the meta-cache has invalidated the stale entry. We saw this issue only in the colocated database when there is an index. This is because as part of executing `CREATE INDEX` command, we ask for the TableLocations of the parent colocated tablet.

The diff fixes the problem by introducing a new tserver RPC `ClearMetaCacheEntriesForNamespace` which clears all the metacache entries (tables and tablets) related to the clone database. This RPC is sent to all tservers as part of clone workflow. More specifically, clearing the metacache happens at the final step of clone i.e. after successfully restoring the snapshot on the clone database but before enabling user connections to the database. User connections to the clone database are enabled after successfully clearing the stale metacache entries of all tservers.

**Upgrade/Rollback safety**
The diff adds a new RPC `ClearMetacache` that is only used in instant database cloning workflow currently. The clone feature is protected by the preview flag: `enable_db_clone`.

Jira: DB-10520, DB-10522

Test Plan:
./yb_build.sh fastdebug --cxx-test integration-tests_minicluster-snapshot-test --gtest_filter Colocation/PgCloneTestWithColocatedDBParam.CloneAfterDropIndex/1

Also tested manually that the ClearMetacache is clearing only the entries that belong to one specific database using the end point: `:9000/api/v1/meta-cache` which shows the set of tablets in the metacache. I checked that the tablet `0000000000` is not cleared after executing the RPC as intented.

Reviewers: asrivastava, mlillibridge

Reviewed By: asrivastava

Subscribers: yguan, ybase, slingam

Differential Revision: https://phorge.dev.yugabyte.com/D37353
jasonyb pushed a commit that referenced this issue Sep 19, 2024
Summary:
 84f3fab [PLAT-15322] Make sure build files have fresh last_modified date to make sure Play Framework assets caching works as expected
 c7af74d [PLAT-15288] Use set_dbs endpoint when editing table selection for db scoped DR configs
 8faeca6 [PLAT-15300] Update task progress poller logic
 e4f5943 [doc][ybm] VictoriaMetrics (#23819)
 70aa7d7 [doc][ybm] Tablet peer alert (#23942)
 2f70696 [doc] Smart driver clarification (#23933)
 1525ced [docs] fix for a yb version not rendering (#23944)
 e9f3ec2 [#23843] YSQL: Fix flaky test testSchemaMismatchRetry in TestPgBatch
 89b69cf [#23943]: YSQL: Fix Bitmap Scan crash in fastdebug GCC11
 27446e2 [DOC-470] Include SSL Connectivity within the source database tabs. (#23878)
 def0fac [#23879] docdb: Improve rpc metrics test.
 87a936a [PLAT-15353] Consistency checks testing hooks
 0c41023 [#23956] YSQL: Fix org.yb.pgsql.TestYsqlMetrics#testMetricRows
 388e045 [#21625,#21627] Docdb: Clear stale meta-cache entries at the end of clone
 Excluded: 5523770 [#23547] YSQL: fix pg_hint_plan crash with pg_hint_plan.enable_hint_table enabled
 5951e18 [#23881] docdb: Update the hint to advisory_locks
 10b5009 Minorfixes (#23986)
 3d33b3e [PLAt-15133][PLAT-15332] Fix the preflight check for disk mount
 9d8366b [PLAT-15345] Set  755 permissions on node-agent service file
 d298d44 [#22135] YSQL: Avoid read restart errors with ANALYZE
 240e8f0 [PLAT-15355] Fix Node Addition Precheck logic to work correctly in case provider_id is missing
 2a40433 [PLAT-15262]Add more checks for non-namespace scope supported universes
 f957dda [Docs] Minor fixes to docs pages around transactions (#23926)
 e4a8548 [PLAT-15326] Proper error handling in DDL atomicity check
 10a629e [PLAT-13998][PLAT-15215]Support Image bundle creation and updation in provider requests
 5f95ff9 [#23905] DocDB: Persistence for Master side Table/Object locks
 d4103e8 [#23513]  YSQL: Fix broken org.yb.pgsql.TestYsqlMetrics#testExplainMaxMemory unit test

Merge:
- yb_pg_dbms_alert_session_A.out:
  - "advisory locks are not yet implemented": YB master
    5951e18 changes the hint message
    for test queries added by YB pg15.  Update the hint message.
- yb_pg_dbms_alert_session_B.out: (same)
- yb_pg_dbms_alert_session_C.out: (same)
- yb_pg_dbms_pipe_session_A.out: (same)
- yb_pg_dbms_pipe_session_B.out: (same)

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher, qhu

Reviewed By: qhu

Subscribers: qhu

Differential Revision: https://phorge.dev.yugabyte.com/D38163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.23_blocker area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

4 participants