Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

Online Restore Project #87

Open
5 of 8 tasks
overvenus opened this issue Dec 6, 2019 · 3 comments
Open
5 of 8 tasks

Online Restore Project #87

overvenus opened this issue Dec 6, 2019 · 3 comments
Assignees
Labels

Comments

@overvenus
Copy link
Member

overvenus commented Dec 6, 2019

BR support online restore

Overview

Currently, BR only supports offline restore, but some users need online restore.
The goal of this project: to support online restore, during which the online
query should not have a significant impact.

Problem Statement

BR will adjust TiKV to import mode during restore. In this mode, TiKV can
quickly import a large amount of data through RocksDB IngestSST, but this mode
will affect the online query seriously.

During restore, BR will split a large number of small regions and randomly
scatter them to each TiKV. This will affect PD scheduling decisions, which will
have an unpredictable impact on the online query.

After restore is complete, the entire cluster executes full Compaction,
which will affect the IO/CPU greatly.

Before the restore is completed, the restore data is inconsistent, but there is
no restriction to prevent TiDB from operating on the data.

Proposed Solution

Online import scenarios can be roughly divided into two categories:

  1. No new TiKV nodes join the cluster, and restore is performed on the old nodes.
  2. A new TiKV node joins the cluster and performs restore on the new node.

For case 1, we can:

  • Skip setting import mode and control the impact of import on online query
    through flow control. Specific strategies:
    • Flow control Upload/Download.
    • Flow Control Ingest SST.
  • After restore is complete, skip Compaction. In addition to optimizing
    RocksDB data structure, Compaction also writes SST table properties,
    so to skip Compaction, TiKV needs:
    • Use TiKV table properties to process SST files during Upload/Download.

For case 2, we can:

  • Let the scatter region support scatter regions on specified TiKV nodes.
  • Let PD better handle scheduling in online restore scenarios.
  • Follow the offline restore process on TiKV.

In both cases, TiDB needs to support making the restored database/table
invisible to the user.

Success Criteria

  • Achieve online restore.
  • No service unavailability during online restore.
  • Write stall cannot occur during online restore.
  • The user cannot operate the recovered database/table during online restore.

Difficulty

  • Medium

Score

  • 4000

TODO list

  • TiKV skips setting import mode during restore. (score: 200)
  • TiKV can flow control Upload/Download SST during restore. (score: 200)
  • TiKV can flow control ingest SST during restore. (score: 200)
  • Upload/Download handle table properties needed for SST to write to TiKV. (score: 300)
  • PD support scatter regions on specified TiKV nodes. (score: 800)
  • PD scheduling should not be affected by the restore scenario. (score: 800)
  • TiDB supports hiding the specified database/table. (score: 1000)
  • Add command in BR to rollback settings(PD and TiKV settings). (score: 500)

Mentor(s)

Recommended skills

  • Go language
  • Basic understanding of TiKV and PD

Time

GanttStart: YYYY-MM-DD
GanttDue: YYYY-MM-DD

@wjhuang2016
Copy link
Member

wjhuang2016 commented Dec 16, 2019

@overvenus It seems that we don't need to prevent show tables, show columns ?
Is lock table xx write enough to "prevent TiDB from operating on the data."?

@zimulala
Copy link

zimulala commented Dec 17, 2019

@wjhuang2016
We cannot guarantee that there is no other operation between "create table" and "lock table". Besides, when the session is closed, the session lock table is cleared.

@IANTHEREAL
Copy link
Collaborator

Are All related development tasks completed?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants