[Feature] AutoScaling For TidbCluster #1651

Yisaer · 2020-02-07T07:04:23Z

Description

Describe the feature you'd like:
TiDB-Operator is going to support Auto-scaling feature for TiKV and TiDB in TidbCluster.

Auto-scaling would help the Operator users to auto-scale out/in for the TiKV and TiDB for the TiDBCluster by the metrics / value / resources which the users could provide. This issue is to discuss and track the whole process of the Auto-scaling design and realization.

First, I think there are some prerequisites to the Auto-Scaling for TidbClusters:

The Customized Ability for the Auto-scaling Algorithm. (must to have) TidbCluster is sensitive to the scaling as the distributed databases, we should have the ability to control the whole auto-scaling process.
The Metrics configurations should be extendable. (must to have) Currently, the Auto-scaling need the TidbCluster metrics info to decide the recommended numbers of TiKV and TiDB. In the future, the Platform information (Kubernetes) or the external global metrics/values are also necessary.
The Interval duration Control of the Auto-scaling should be Cluster Level. (nice to have) Operator could manager several clusters in one Kubernetes. We should provide the cluster level control ability and Interval duration is also important to avoid performance jitter.

Auto-Scaling Design
To support this feature and meet the prerequisites, the Auto-scaling is designed to create one new API (TidbClusterAutoScaler) and one new Controller AutoScaler Controller.

TidbClusterAutoScaler is kind of like HPA. The Operator users could use it to auto-scale in/out the TidbCluster by their own demands configured in the TidbClusterAutoScaler Spec.

The AutoScaler Controller would watch the TidbClusterAutoScaler and reconcile it to adjust the replicas in TidbCluster.

TODO List

P0
- Update CRD for tidbautoscaler Update CRD for tidbautoscaler #3156
- Defaulting and validation for tidbautoscaler CR Defaulting and validation for tidbautoscaler CR #3157
- sync tidbautoscaler when spec.xxxx.external is configured sync tidbautoscaler when spec.xxxx.external is configured #3158
- Sync tidbautoscaler with PD API Sync tidbautoscaler with PD API #3159

Workload Estimation (P0 features)

45

Time

GanttStart: 2020-07-13
GanttDue: 2020-09-30

Documentations

Project

The text was updated successfully, but these errors were encountered:

Yisaer · 2020-02-07T07:22:56Z

We welcome everyone to help to realize the Auto-scaling feature in Operator by discuss/suggest/code review/pull request/ etc.

Yisaer · 2020-02-17T08:30:34Z

Currently, the auto-scaling is under alpha feature.
It only provides the ability below:

basic auto-scaling Availability
basic guarantee to avoid jitter during auto-scaling.

To achieve Production ready:

control the scaling step for auto-scaling (need discussed)
record the consecutive count timestamp in auto-scaling (need discussed)
fetch store info from pdapi .
skip the consecutive count control if the auto-scaling results are the same between 2 times. ( need discussed)
add timeout for prometheus query.
make filterTidbInstance compatible with tidb failover
noise reduction.

Yisaer · 2020-02-19T11:02:56Z

Here describes How the auto-scaling algorithm work by average CPU load.

#1722

Yisaer · 2020-02-20T10:28:59Z

After #1731 merged, we would have auto-scaling ability based by cpu load feature under alpha stage.
Currently, there are still plenty of jobs to do:

Add Syncing TidbClusterAutoCluster Status
Add proper events and logs
Design for noise reduction for auto-scaling process
Support specific pd label for the auto-scaling out tikv instances.
Unit test and e2e test for auto-scaling
Support TidbMonitor

Yisaer · 2020-02-24T04:54:08Z

There are several good first issue about auto-scaling`, we are welcome the newcomers to join the contribution by assign these tasks.

Ref:
#1751
#1752
#1753

Yisaer · 2020-02-28T04:51:43Z

Syncing the replicas between online configuration and local configuration is always the problem after we use autoscaler ( or hpa）, this issue request the new feature to solve this problem.
#1818

Yisaer · 2020-02-28T06:04:48Z

To improve the user-experience, we should enhance the information by executing

kubectl get tidbclusterautoscaler

Ref:
#1820

Yisaer · 2020-04-16T03:41:02Z

Currently, we have released Auto-scaling as an alpha feature in operator 1.1 version which based on the cpu load. After that, we would start to focus on the following 3 things:

support more kinds of metrics in auto-scaling
The noise reduction for auto-scaling
For tikv and tidb auto-scaling, we would try to use heterogeneous design instead of mutation webhook.

The e2e test should also be completed.

Yisaer · 2020-04-26T04:14:01Z

We are happy to announce that the auto-scaling is going to have the external strategy ability that exposes the http interface to let community user could use their own auto-scaling strategy (like predicting strategy by AI) to affect to tidbcluster auto-scaling.

For more detail, see: #2279

DanielZhangQD · 2020-05-06T08:53:46Z

UCP:
- P1:
  - noise reduction [UCP] Noise reduction support when cluster auto scaling #2307
  - noise reduction for tidb
  - control the scaling step for auto-scaling step control for auto-scaling #2372
Non-UCP:
- The replicas updated by auto-scaler may be overwritten by applying local yaml Add locking replicas ability for Auto-scaling #1818
- fetch store info from pdapi
- make filterTidbInstance compatible with tidb failover
- support more kinds of metrics in auto-scaling
- For tikv and tidb auto-scaling, we would try to use heterogeneous design instead of mutation webhook.

Yisaer added status/PTAL PR needs to be reviewed area/auto-scaling related to auto-scaling status/WIP Issue/PR is being worked on and removed status/PTAL PR needs to be reviewed labels Feb 7, 2020

Yisaer pinned this issue Feb 7, 2020

This was referenced Feb 10, 2020

Add Controller for TidbClusterAutoScale #1672

Merged

Sync AutoScaler Controller #1680

Closed

Sync Auto-Scaler Controller #1684

Merged

sre-bot mentioned this issue Feb 12, 2020

Sync Auto-Scaler Controller (#1684) #1686

Merged

Yisaer mentioned this issue Feb 14, 2020

Add consecutive count check for Auto-scaling #1703

Merged

sre-bot mentioned this issue Feb 17, 2020

Add consecutive count check for Auto-scaling (#1703) #1715

Merged

Yisaer mentioned this issue Feb 19, 2020

Add auto-scaling calculation based by CPU load #1722

Merged

sre-bot mentioned this issue Feb 19, 2020

Add auto-scaling calculation based by CPU load (#1722) #1726

Merged

This was referenced Feb 19, 2020

Finish auto-scaler controller #1731

Merged

Remove consecutive check #1732

Merged

sre-bot mentioned this issue Feb 20, 2020

Remove consecutive check (#1732) #1733

Merged

Yisaer mentioned this issue Feb 20, 2020

Manually cp #1731 #1736

Merged

DanielZhangQD added this to the v1.1.1 milestone Mar 31, 2020

Yisaer mentioned this issue Apr 21, 2020

Support heterogeneous design #2240

Closed

9 tasks

cofyc added the status/help-wanted Extra attention is needed label Jun 8, 2020

DanielZhangQD removed this from the v1.1.1 milestone Jun 8, 2020

scsldb added the priority/P0 label Jul 3, 2020

scsldb added this to the v1.2.0 milestone Jul 3, 2020

nolouch mentioned this issue Jul 5, 2020

Support Auto-Scaling on DBaaS pingcap/tidb#18374

Closed

10 tasks

This was referenced Jul 13, 2020

Support Storage Auto-scaling for TiKV #2909

Closed

Use TiKVGroup to support Auto-scaling #2910

Closed

scsldb assigned Yisaer Jul 14, 2020

Yisaer unpinned this issue Jul 16, 2020

DanielZhangQD mentioned this issue Jul 27, 2020

Support kubernetes cron horizontal pod autoscaler controller using crontab like scheme #3012

Closed

DanielZhangQD modified the milestones: v1.2.0, v1.2.0-alpha.1 Sep 10, 2020

DanielZhangQD added priority:P1 and removed priority/P0 labels Sep 25, 2020

DanielZhangQD modified the milestones: v1.2.0-alpha.1, v1.2.0 Nov 10, 2020

DanielZhangQD mentioned this issue Jan 14, 2021

Doc for auto-scaling with PD API pingcap/docs-tidb-operator#1045

Closed

DanielZhangQD removed this from the v1.3.0 milestone Feb 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] AutoScaling For TidbCluster #1651

[Feature] AutoScaling For TidbCluster #1651

Yisaer commented Feb 7, 2020 •

edited by DanielZhangQD

Loading

Yisaer commented Feb 7, 2020

Yisaer commented Feb 17, 2020 •

edited

Loading

Yisaer commented Feb 19, 2020

Yisaer commented Feb 20, 2020 •

edited

Loading

Yisaer commented Feb 24, 2020

Yisaer commented Feb 28, 2020

Yisaer commented Feb 28, 2020

Yisaer commented Apr 16, 2020 •

edited

Loading

Yisaer commented Apr 26, 2020

DanielZhangQD commented May 6, 2020 •

edited

Loading

[Feature] AutoScaling For TidbCluster #1651

[Feature] AutoScaling For TidbCluster #1651

Comments

Yisaer commented Feb 7, 2020 • edited by DanielZhangQD Loading

Description

Category

TODO List

Workload Estimation (P0 features)

Time

Documentations

Project

Yisaer commented Feb 7, 2020

Yisaer commented Feb 17, 2020 • edited Loading

Yisaer commented Feb 19, 2020

Yisaer commented Feb 20, 2020 • edited Loading

Yisaer commented Feb 24, 2020

Yisaer commented Feb 28, 2020

Yisaer commented Feb 28, 2020

Yisaer commented Apr 16, 2020 • edited Loading

Yisaer commented Apr 26, 2020

DanielZhangQD commented May 6, 2020 • edited Loading

Yisaer commented Feb 7, 2020 •

edited by DanielZhangQD

Loading

Yisaer commented Feb 17, 2020 •

edited

Loading

Yisaer commented Feb 20, 2020 •

edited

Loading

Yisaer commented Apr 16, 2020 •

edited

Loading

DanielZhangQD commented May 6, 2020 •

edited

Loading