Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UCP] Noise reduction support when cluster auto scaling #2307

Merged
merged 1 commit into from
May 27, 2020

Conversation

vincent178
Copy link
Contributor

@vincent178 vincent178 commented Apr 26, 2020

What problem does this PR solve?

UCP #2241

What is changed and how does it work?

Add Phrase for TidbClusterAutoScaler, it has three optional values: Normal, ReadyToScaleOut, ReadyToScaleIn.
Add user defined autoscale threshold.

use tidb as an example, tikv should be the same.

  1. check phase
    after autoscaler calculate the target replica, if target replica equals to current replica, set auto scaler phase to normal and return, otherwise check if the phase equal to scaleOut or scaleIn, if not, update the phase and record timestamp. go to 2.

  2. check timestamp threshold
    check the record timestamp remains longer than threshold which user defined. If not, return, otherwise go to 3.

  3. do autoscale
    set phase to normal and do normal autoscale.

Design for e2e tests:
After mock response from Prometheus, without noise reduction, it will start auto scaling after maximum 30s, but with noise reduction, we expect at least in 300s (ReadyToScaleThresholdSeconds),
cluster should remain the replica number. And after ReadyToScaleThresholdSeconds time, the cluster should start normal auto scaling and auto scale phase should be back to normal.

Check List

Tests

  • Unit test
  • E2E test

Code changes

  • Has Go code change

Side effects

n/a

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Does this PR introduce a user-facing change?:

User now can specify `ReadyToScaleThresholdSeconds` in `TidbClusterAutoScaler`, until cluster keep in ready to auto scale status for `ReadyToScaleThresholdSeconds` seconds, then cluster start auto scaling.

@sre-bot
Copy link
Contributor

sre-bot commented Apr 26, 2020

Thanks for your contribution. If your PR get merged, you will be rewarded 2000 points.

@CLAassistant
Copy link

CLAassistant commented Apr 26, 2020

CLA assistant check
All committers have signed the CLA.

@vincent178 vincent178 force-pushed the auto-scaling-noise-reduction branch 10 times, most recently from 52cc533 to de74eb8 Compare April 27, 2020 08:40
Copy link
Contributor

@Yisaer Yisaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems generally good to me. Please address the comment.

And there is one more question for me about the time-window noise-reduction. It seems the tidb autoscaler should have the same result Scalein / ScaleOut in a certain time then it would auto-scale. If the Operator is down during the time-window, would the noise reduction fail?

For example:
0:01s Tidb auto-scaler have ScaleOut result, the status become ScaleOut
0:02s Operator Down
4:01s Operator Up
4:02s Tidb auto-scaler have ScaleOut result again, the auto-scaler admit to scale-out tidb.

In this case, I think the noise reduction fail

pkg/apis/pingcap/v1alpha1/tidbclusterautoscaler_types.go Outdated Show resolved Hide resolved
pkg/label/label.go Outdated Show resolved Hide resolved
pkg/autoscaler/autoscaler/util.go Outdated Show resolved Hide resolved
pkg/autoscaler/autoscaler/tidb_autoscaler.go Outdated Show resolved Hide resolved
pkg/autoscaler/autoscaler/tidb_autoscaler.go Outdated Show resolved Hide resolved
@vincent178 vincent178 force-pushed the auto-scaling-noise-reduction branch 16 times, most recently from 5140412 to 90e23cf Compare April 29, 2020 21:38
@vincent178 vincent178 force-pushed the auto-scaling-noise-reduction branch from 305220a to dce8456 Compare May 25, 2020 01:04
@cofyc
Copy link
Contributor

cofyc commented May 25, 2020

/run-e2e-tests

@vincent178 vincent178 force-pushed the auto-scaling-noise-reduction branch 4 times, most recently from 5b69b83 to 579ad1b Compare May 25, 2020 21:48
@vincent178
Copy link
Contributor Author

@cofyc @Yisaer @DanielZhangQD PTAL

@vincent178 vincent178 force-pushed the auto-scaling-noise-reduction branch from 579ad1b to 49a70b3 Compare May 26, 2020 01:11
@Yisaer
Copy link
Contributor

Yisaer commented May 26, 2020

/run-e2e-in-kind

Copy link
Contributor

@Yisaer Yisaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost LGTM

Comment on lines 90 to 92
// If not set, the default ReadyToScaleThresholdSeconds will be set to 300.
// +optional
ReadyToScaleThresholdSeconds *int32 `json:"readyToScaleThresholdSeconds,omitempty"`
Copy link
Contributor

@Yisaer Yisaer May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300 is about 10 times re-sync duration, I think it might be too long for a default value. What about 30 as it only require 1 re-sync duration.

Comment on lines 222 to 224
if tac.Spec.TiKV.ReadyToScaleThresholdSeconds == nil {
tac.Spec.TiKV.ReadyToScaleThresholdSeconds = pointer.Int32Ptr(300)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300 is too large. ditto.

AnnTiKVReadyToScaleTimestamp = "tikv.tidb.pingcap.com/ready-to-scale-timestamp"

// AnnLastSyncingTimestamp records last sync timestamp
AnnLastSyncingTimestamp = "auto-scaling.tidb.pingcap.com/last-syncing-timestamp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think auto-scaling here is unnecessary, this annotation could be common used for each API Object.

@Yisaer Yisaer requested a review from DanielZhangQD May 26, 2020 04:59
@vincent178 vincent178 force-pushed the auto-scaling-noise-reduction branch 2 times, most recently from e3e2a7a to f620f6e Compare May 26, 2020 22:06
* add AnnLastSyncingTimestamp timestamp
* add ReadyToScaleThresholdSeconds in AutoScalerSpec
* add AnnTiKVReadyToScaleTimestamp timestamp labels to record AutoScalerPhase
* add Normal, ReadyToScaleOut and ReadyToScaleIn three AutoScalerPhase
* add checkStsReadyAutoScalingTimestamp to check AutoScalerPhase
timestamp, only for TiKV
* add checkStsLastSyncTimestamp to check maximum thresholdSec allowed
before reset phase to Normal, only for TiKV
* add checkStsAutoScaling combine checkStsLastSyncTimestamp,
checkStsReadyAutoScalingTimestamp and checkStsAutoScalingInterval
* add unit tests
* add integration e2e tests
* update doc
@vincent178 vincent178 force-pushed the auto-scaling-noise-reduction branch from f620f6e to 2747929 Compare May 27, 2020 02:47
@Yisaer
Copy link
Contributor

Yisaer commented May 27, 2020

/run-e2e-in-kind

@cofyc
Copy link
Contributor

cofyc commented May 27, 2020

/merge

@sre-bot
Copy link
Contributor

sre-bot commented May 27, 2020

Your auto merge job has been accepted, waiting for:

  • 2504

@Yisaer Yisaer merged commit d7af535 into pingcap:master May 27, 2020
@sre-bot
Copy link
Contributor

sre-bot commented May 27, 2020

Team qidelongdongqiang complete task #2241 and get 2500 score, current score 3979

@Yisaer
Copy link
Contributor

Yisaer commented May 27, 2020

/run-cherry-picker

sre-bot pushed a commit to sre-bot/tidb-operator that referenced this pull request May 27, 2020
@sre-bot
Copy link
Contributor

sre-bot commented May 27, 2020

cherry pick to release-1.1 in PR #2568

@vincent178 vincent178 deleted the auto-scaling-noise-reduction branch May 27, 2020 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants