Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: OnlineDDL migrations fail when keyspace has thousands of tables #17059

Open
mattlord opened this issue Oct 23, 2024 · 2 comments · May be fixed by #17066
Open

Bug Report: OnlineDDL migrations fail when keyspace has thousands of tables #17059

mattlord opened this issue Oct 23, 2024 · 2 comments · May be fixed by #17066
Assignees
Labels
Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Type: Bug

Comments

@mattlord
Copy link
Contributor

Overview of the Issue

On keyspaces that have thousands of tables, OnlineDDL migrations are unable to complete.

Reproduction Steps

./101_initial_cluster.sh

for i in {1..3000}; do
  mysql commerce -e "create table test${i} (id int not null primary key, c1 varchar(100))"
done

vtctldclient ApplySchema --ddl-strategy "vitess" --sql "ALTER TABLE test1 ADD age int" commerce

sleep 60

mysql -e "show vitess_migrations\G"

vtctldclient GetWorkflows commerce --include-logs=false --compact

mysql commerce/0@primary -e "select * from _vt.vreplication_log\G"

With the final output being this, which shows that the VReplication workflow itself seems to be fine, but the OnlineDDL migration is stuck:

edb729f8_915b_11ef_85c1_56bfd605e62f
*************************** 1. row ***************************
                             id: 1
                 migration_uuid: edb729f8_915b_11ef_85c1_56bfd605e62f
                       keyspace: commerce
                          shard: 0
                   mysql_schema: vt_commerce
                    mysql_table: test1
            migration_statement: alter table test1 add column age int
                       strategy: vitess
                        options:
                added_timestamp: 2024-10-23 12:29:08
            requested_timestamp: 2024-10-23 12:29:09
                ready_timestamp: NULL
              started_timestamp: 2024-10-23 12:29:10
             liveness_timestamp: 2024-10-23 12:29:43
            completed_timestamp: NULL
              cleanup_timestamp: NULL
               migration_status: running
                       log_path:
                      artifacts: _vt_vrp_edb729f8915b11ef85c156bfd605e62f_20241023122909_,
                        retries: 0
                         tablet: zone1-0000000101
                 tablet_failure: 0
                       progress: 100
              migration_context: vtctl:edb47546-915b-11ef-85c1-56bfd605e62f
                     ddl_action: alter
                        message: rpc error: code = DeadlineExceeded desc = context deadline exceeded
              message_timestamp: 2024-10-23 12:29:53.066772
                    eta_seconds: 0
                    rows_copied: 0
                     table_rows: 0
              added_unique_keys: 0
            removed_unique_keys: 0
                       log_file:
       retain_artifacts_seconds: 86400
            postpone_completion: 0
       removed_unique_key_names:
dropped_no_default_column_names:
          expanded_column_names:
               revertible_notes:
               allow_concurrent: 0
                  reverted_uuid:
                        is_view: 0
              ready_to_complete: 1
      vitess_liveness_indicator: 1729700982
            user_throttle_ratio: 0
                   special_plan:
       last_throttled_timestamp: NULL
            component_throttled:
               reason_throttled:
            cancelled_timestamp: NULL
                postpone_launch: 0
                          stage: waiting for post-sentry pos: MySQL56/aa88c6a0-915b-11ef-b535-fa5bd92f5f51:1-6073
               cutover_attempts: 1
         is_immediate_operation: 0
             reviewed_timestamp: 2024-10-23 12:29:10
    ready_to_complete_timestamp: 2024-10-23 12:29:43
      removed_foreign_key_names:
 last_cutover_attempt_timestamp: 2024-10-23 12:29:43
                  force_cutover: 0


{
  "workflows": [
    {
      "name": "edb729f8_915b_11ef_85c1_56bfd605e62f",
      "source": {
        "keyspace": "commerce",
        "shards": [
          "0"
        ]
      },
      "target": {
        "keyspace": "commerce",
        "shards": [
          "0"
        ]
      },
      "max_v_replication_lag": "1",
      "shard_streams": {
        "0/zone1-0000000101": {
          "streams": [
            {
              "id": "1",
              "shard": "0",
              "tablet": {
                "cell": "zone1",
                "uid": 101
              },
              "binlog_source": {
                "keyspace": "commerce",
                "shard": "0",
                "filter": {
                  "rules": [
                    {
                      "match": "_vt_vrp_edb729f8915b11ef85c156bfd605e62f_20241023122909_",
                      "filter": "select `id` as `id`, `c1` as `c1` from `test1`",
                      "source_unique_key_columns": "id",
                      "target_unique_key_columns": "id",
                      "source_unique_key_target_columns": "id",
                      "force_unique_key": "PRIMARY"
                    }
                  ]
                }
              },
              "position": "aa88c6a0-915b-11ef-b535-fa5bd92f5f51:1-6098",
              "state": "Running",
              "db_name": "vt_commerce",
              "transaction_timestamp": {
                "seconds": "1729701007"
              },
              "time_updated": {
                "seconds": "1729701008"
              },
              "tags": [
                ""
              ],
              "throttler_status": {
                "time_throttled": {}
              },
              "tablet_types": [
                "REPLICA",
                "PRIMARY"
              ],
              "tablet_selection_preference": "INORDER",
              "cells": [
                ""
              ]
            }
          ],
          "is_primary_serving": true
        }
      },
      "workflow_type": "OnlineDDL",
      "workflow_sub_type": "None",
      "max_v_replication_transaction_lag": "1",
      "options": {}
    }
  ]
}


*************************** 1. row ***************************
        id: 1
  vrepl_id: 1
      type: Stream Created
     state: Stopped
created_at: 2024-10-23 12:29:23
updated_at: 2024-10-23 12:29:23
   message: {"cell":"","component_throttled":"","db_name":"vt_commerce","defer_secondary_keys":"0","id":"1","max_replication_lag":"9223372036854775807","max_tps":"9223372036854775807","options":"{}","pos":"","reason_throttled":"","rows_copied":"0","source":"keyspace:\"commerce\" shard:\"0\" filter:{rules:{match:\"_vt_vrp_edb729f8915b11ef85c156bfd605e62f_20241023122909_\" filter:\"select `id` as `id`, `c1` as `c1` from `test1`\" source_unique_key_columns:\"id\" target_unique_key_columns:\"id\" source_unique_key_target_columns:\"id\" force_unique_key:\"PRIMARY\"}}","state":"Stopped","tablet_types":"in_order:REPLICA,PRIMARY","tags":"","time_heartbeat":"0","time_throttled":"0","time_updated":"1729700963","transaction_timestamp":"0","workflow":"edb729f8_915b_11ef_85c1_56bfd605e62f","workflow_sub_type":"0","workflow_type":"5"}
     count: 1
*************************** 2. row ***************************
        id: 2
  vrepl_id: 1
      type: State Changed
     state: Running
created_at: 2024-10-23 12:29:23
updated_at: 2024-10-23 12:29:23
   message:
     count: 2
*************************** 3. row ***************************
        id: 3
  vrepl_id: 1
      type: Started Copy Phase
     state: Copying
created_at: 2024-10-23 12:29:23
updated_at: 2024-10-23 12:29:23
   message: Copy phase started for 1 table(s)
     count: 1
*************************** 4. row ***************************
        id: 4
  vrepl_id: 1
      type: Ended Copy Phase
     state: Copying
created_at: 2024-10-23 12:29:37
updated_at: 2024-10-23 12:29:37
   message: Copy phase completed at gtid aa88c6a0-915b-11ef-b535-fa5bd92f5f51:1-6052
     count: 1
*************************** 5. row ***************************
        id: 5
  vrepl_id: 1
      type: State Changed
     state: Running
created_at: 2024-10-23 12:29:37
updated_at: 2024-10-23 12:29:37
   message:
     count: 1

Binary Version

❯ vtgate --version
vtgate version Version: 22.0.0-SNAPSHOT (Git revision b0b79813f21f8ecbf409f558ad6f8864332637cf branch 'main') built on Wed Oct 23 12:00:26 EDT 2024 by [email protected] using go1.23.2 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

No response

@mattlord mattlord added Type: Bug Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) labels Oct 23, 2024
@mattlord mattlord self-assigned this Oct 23, 2024
@mattlord
Copy link
Contributor Author

The problem seems to be related to the sentry table that is created. We create that table and then wait for its event to be replicated here:

parsed := sqlparser.BuildParsedQuery(sqlCreateSentryTable, sentryTableName)
if _, err := e.execQuery(ctx, parsed.Query); err != nil {
return err
}
e.updateMigrationStage(ctx, onlineDDL.UUID, "sentry table created: %s", sentryTableName)
postSentryPos, err := e.primaryPosition(ctx)
if err != nil {
return err
}
e.updateMigrationStage(ctx, onlineDDL.UUID, "waiting for post-sentry pos: %v", replication.EncodePosition(postSentryPos))
if err := waitForPos(s, postSentryPos); err != nil {
return err
}
e.updateMigrationStage(ctx, onlineDDL.UUID, "post-sentry pos reached")

And we hit the cutover migration's 10 second timeout before it's replicated.

@mattlord
Copy link
Contributor Author

mattlord commented Oct 24, 2024

This test case passes when specifying the longer cutover timeout:

git checkout main && make build

cd examples/local
alias vtctldclient='command vtctldclient --server=localhost:15999'

./101_initial_cluster.sh

for i in {1..3000}; do
  mysql commerce -e "create table test${i} (id int not null primary key, c1 varchar(100))"
done

vtctldclient ApplySchema --ddl-strategy "vitess --cut-over-threshold=30s" --sql "ALTER TABLE test1 ADD age int" commerce

sleep 60

mysql -e "show vitess_migrations\G"

vtctldclient GetWorkflows commerce --include-logs=false --compact

mysql commerce/0@primary -e "select * from _vt.vreplication_log\G"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants