Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] AutoScaling For TidbCluster #1651

Open
3 of 4 tasks
Yisaer opened this issue Feb 7, 2020 · 10 comments
Open
3 of 4 tasks

[Feature] AutoScaling For TidbCluster #1651

Yisaer opened this issue Feb 7, 2020 · 10 comments
Assignees
Labels
area/auto-scaling related to auto-scaling priority:P1 status/help-wanted Extra attention is needed status/WIP Issue/PR is being worked on

Comments

@Yisaer
Copy link
Contributor

Yisaer commented Feb 7, 2020

Description

Describe the feature you'd like:
TiDB-Operator is going to support Auto-scaling feature for TiKV and TiDB in TidbCluster.

Auto-scaling would help the Operator users to auto-scale out/in for the TiKV and TiDB for the TiDBCluster by the metrics / value / resources which the users could provide. This issue is to discuss and track the whole process of the Auto-scaling design and realization.

First, I think there are some prerequisites to the Auto-Scaling for TidbClusters:

  • The Customized Ability for the Auto-scaling Algorithm. (must to have) TidbCluster is sensitive to the scaling as the distributed databases, we should have the ability to control the whole auto-scaling process.
  • The Metrics configurations should be extendable. (must to have) Currently, the Auto-scaling need the TidbCluster metrics info to decide the recommended numbers of TiKV and TiDB. In the future, the Platform information (Kubernetes) or the external global metrics/values are also necessary.
  • The Interval duration Control of the Auto-scaling should be Cluster Level. (nice to have) Operator could manager several clusters in one Kubernetes. We should provide the cluster level control ability and Interval duration is also important to avoid performance jitter.

Auto-Scaling Design
To support this feature and meet the prerequisites, the Auto-scaling is designed to create one new API (TidbClusterAutoScaler) and one new Controller AutoScaler Controller.

TidbClusterAutoScaler is kind of like HPA. The Operator users could use it to auto-scale in/out the TidbCluster by their own demands configured in the TidbClusterAutoScaler Spec.

The AutoScaler Controller would watch the TidbClusterAutoScaler and reconcile it to adjust the replicas in TidbCluster.

Category

Auto-Scaling

TODO List

Workload Estimation (P0 features)

45

Time

GanttStart: 2020-07-13
GanttDue: 2020-09-30

Documentations

Project

@Yisaer Yisaer added status/PTAL PR needs to be reviewed area/auto-scaling related to auto-scaling status/WIP Issue/PR is being worked on and removed status/PTAL PR needs to be reviewed labels Feb 7, 2020
@Yisaer Yisaer pinned this issue Feb 7, 2020
@Yisaer
Copy link
Contributor Author

Yisaer commented Feb 7, 2020

We welcome everyone to help to realize the Auto-scaling feature in Operator by discuss/suggest/code review/pull request/ etc.

@Yisaer
Copy link
Contributor Author

Yisaer commented Feb 17, 2020

Currently, the auto-scaling is under alpha feature.
It only provides the ability below:

  1. basic auto-scaling Availability
  2. basic guarantee to avoid jitter during auto-scaling.

To achieve Production ready:

  1. control the scaling step for auto-scaling (need discussed)
  2. record the consecutive count timestamp in auto-scaling (need discussed)
  3. fetch store info from pdapi .
  4. skip the consecutive count control if the auto-scaling results are the same between 2 times. ( need discussed)
  5. add timeout for prometheus query.
  6. make filterTidbInstance compatible with tidb failover
  7. noise reduction.

@Yisaer
Copy link
Contributor Author

Yisaer commented Feb 19, 2020

Here describes How the auto-scaling algorithm work by average CPU load.

#1722

@Yisaer
Copy link
Contributor Author

Yisaer commented Feb 20, 2020

After #1731 merged, we would have auto-scaling ability based by cpu load feature under alpha stage.
Currently, there are still plenty of jobs to do:

  1. Add Syncing TidbClusterAutoCluster Status
  2. Add proper events and logs
  3. Design for noise reduction for auto-scaling process
  4. Support specific pd label for the auto-scaling out tikv instances.
  5. Unit test and e2e test for auto-scaling
  6. Support TidbMonitor

@Yisaer
Copy link
Contributor Author

Yisaer commented Feb 24, 2020

There are several good first issue about auto-scaling`, we are welcome the newcomers to join the contribution by assign these tasks.

Ref:
#1751
#1752
#1753

@Yisaer
Copy link
Contributor Author

Yisaer commented Feb 28, 2020

Syncing the replicas between online configuration and local configuration is always the problem after we use autoscaler ( or hpa), this issue request the new feature to solve this problem.
#1818

@Yisaer
Copy link
Contributor Author

Yisaer commented Feb 28, 2020

To improve the user-experience, we should enhance the information by executing

kubectl get tidbclusterautoscaler

Ref:
#1820

@DanielZhangQD DanielZhangQD added this to the v1.1.1 milestone Mar 31, 2020
@Yisaer
Copy link
Contributor Author

Yisaer commented Apr 16, 2020

Currently, we have released Auto-scaling as an alpha feature in operator 1.1 version which based on the cpu load. After that, we would start to focus on the following 3 things:

  1. support more kinds of metrics in auto-scaling
  2. The noise reduction for auto-scaling
  3. For tikv and tidb auto-scaling, we would try to use heterogeneous design instead of mutation webhook.

The e2e test should also be completed.

@Yisaer
Copy link
Contributor Author

Yisaer commented Apr 26, 2020

We are happy to announce that the auto-scaling is going to have the external strategy ability that exposes the http interface to let community user could use their own auto-scaling strategy (like predicting strategy by AI) to affect to tidbcluster auto-scaling.

For more detail, see: #2279

@DanielZhangQD
Copy link
Contributor

DanielZhangQD commented May 6, 2020

@cofyc cofyc added the status/help-wanted Extra attention is needed label Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling related to auto-scaling priority:P1 status/help-wanted Extra attention is needed status/WIP Issue/PR is being worked on
Projects
None yet
Development

No branches or pull requests

4 participants