fix: rollback to stable with dynamicStableScale could overwhelm stable pods #3077

jessesuen · 2023-10-04T21:43:39Z

Resolves #3020

Changes made:

1. Safer scaledown decisioning of "old" ReplicaSets

When scaling down "old" ReplicaSets, we now additionally check if they are still referenced by services before allowing them to be scaled down. This is an overall safety improvement to make sure we scale down older ReplicaSets only after it doesn't matter anymore (nothing is pointing to them).

2. Delay canary service selector switch, if we are rolling back to stable with dynamicStableScale and it is not fully available

Before this PR, when the user re-applied stable pod spec, the rollout controller would immediately change the service selector of the canary service back to the stable RS. This meant that all production traffic immediately hit the stable RS, even if it was undersized due to dynamicStableScale. This PR detects if dynamicStableScale is used, and if stableRS is not fully available, then we delay the service switch until stableRS is fully available. Now that we no longer touch the the canary service in this scenario, it allows the controller to continue to balance traffic between stable vs. previous desiredRS, so that traffic can be safely shifted away from it (improvement 3 below).

3. Gradually traffic shift back to stable RS to avoid overwhelming stable pods

A user can re-apply the stable manifest in the middle of an update. When this happens, stableRS == desiredRS. Before this PR, when this happened in the middle of an update, we would immediately shift 100% of the traffic back to stableRS (by setting weight to 0), even though it may have been undersized due to dyanamicStableScaling. With this PR, we will now follow similar logic as abort with dynamicStableScaling, and only increase traffic back to stable in accordance with stable's available replica count. A key difference between abort vs. stable rollback is that:

With Abort: traffic is balanced between desired vs. stable
With Stable Rollback: traffic is balanced between previous desired vs. stable

Because of this difference, the trafficrouting logic has special case logic for traffic splitting to happen between previous desired vs. stable in the event of a rollback to stable.

I will be adding an e2e test for this.

Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I've signed my commits with DCO
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
My builds are green. Try syncing with master if they are not.
My organization is added to USERS.md.

jessesuen · 2023-10-04T22:19:30Z

Here is the new order of events after this PR:

First rollout is updated and reaches a pause step where 90% is to canary (this is the same as before):

8s          Normal    Updated VirtualService   rollout/istio-canary                 VirtualService `istio-canary` set to desiredWeight '90'
8m33s       Normal    TrafficWeightUpdated     rollout/istio-canary                 Traffic weight updated from 0 to 90
8m33s       Normal    RolloutStepCompleted     rollout/istio-canary                 Rollout step 1/2 completed (setWeight: 90)
8m33s       Normal    ScalingReplicaSet        rollout/istio-canary                 Scaled down ReplicaSet istio-canary-f7d4dcd68 (revision 15) from 10 to 1
8m33s       Normal    RolloutPaused            rollout/istio-canary                 Rollout is paused (CanaryPauseStep)

Next rollout is updated to go back to the stable pod spec. This is detected as a "rollback to stable". By looking at the Kubernetes events, we can see that we now do things in the correct order:

scale up stable
shift traffic away from the previous desired RS (80 -> 20 -> 0)
switch service selector of canary back to stable RS
scaled down the old rs (previous desired)

8s          Normal    RolloutUpdated           rollout/istio-canary                 Rollout updated to revision 17
8s          Normal    SkipSteps                rollout/istio-canary                 Rollback to stable ReplicaSets
8s          Normal    RolloutResumed           rollout/istio-canary                 Rollout is resumed
8s          Normal    ScalingReplicaSet        rollout/istio-canary                 Scaled up ReplicaSet istio-canary-f7d4dcd68 (revision 17) from 1 to 10
0s          Normal    Updated VirtualService   rollout/istio-canary                 VirtualService `istio-canary` set to desiredWeight '80'
0s          Normal    TrafficWeightUpdated     rollout/istio-canary                 Traffic weight updated from 90 to 80
0s          Normal    Updated VirtualService   rollout/istio-canary                 VirtualService `istio-canary` set to desiredWeight '20'
0s          Normal    TrafficWeightUpdated     rollout/istio-canary                 Traffic weight updated from 80 to 20
0s          Normal    SwitchService            rollout/istio-canary                 Switched selector for service 'istio-canary-canary' from '7b8bcb8869' to 'f7d4dcd68'
0s          Normal    Updated VirtualService   rollout/istio-canary                 VirtualService `istio-canary` set to desiredWeight '0'
0s          Normal    TrafficWeightUpdated     rollout/istio-canary                 Traffic weight updated from 20 to 0
0s          Normal    ScalingReplicaSet        rollout/istio-canary                 Scaled down ReplicaSet istio-canary-7b8bcb8869 (revision 16) from 9 to 0

codecov · 2023-10-04T22:26:48Z

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (047f2a2) 81.82% compared to head (cc1a941) 81.85%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3077      +/-   ##
==========================================
+ Coverage   81.82%   81.85%   +0.02%     
==========================================
  Files         134      134              
  Lines       20507    20556      +49     
==========================================
+ Hits        16780    16826      +46     
- Misses       2864     2866       +2     
- Partials      863      864       +1

Files	Coverage Δ
rollout/canary.go	`76.11% <100.00%> (+1.17%)`	⬆️
rollout/service.go	`79.13% <100.00%> (+0.94%)`	⬆️
utils/replicaset/replicaset.go	`87.63% <ø> (-0.82%)`	⬇️
rollout/replicaset.go	`73.82% <97.36%> (+4.10%)`	⬆️
rollout/bluegreen.go	`68.56% <0.00%> (+0.25%)`	⬆️
utils/replicaset/canary.go	`94.02% <57.14%> (+0.05%)`	⬆️
rollout/trafficrouting.go	`72.85% <73.91%> (-1.08%)`	⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2023-10-04T22:27:17Z

Go Published Test Results

2 083 tests 2 083 ✔️ 2m 48s ⏱️
  118 suites       0 💤
      1 files       0 ❌

Results for commit cc1a941.

♻️ This comment has been updated with latest results.

github-actions · 2023-10-04T23:01:42Z

E2E Tests Published Test Results

0 tests 0 ✔️ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit cc1a941.

♻️ This comment has been updated with latest results.

jessesuen · 2023-10-05T16:00:41Z

This is ready for review (~~remaining work is to add unit + e2e tests~~)

EDIT: unit tests and e2e tests have been added

rollout/canary.go

…vailable Signed-off-by: Jesse Suen <[email protected]>

Signed-off-by: Jesse Suen <[email protected]>

sonarcloud · 2023-10-23T11:45:33Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
4 Security Hotspots
1 Code Smell

No Coverage information
12.1% Duplication

…e pods (#3077) * fix: rollback to stable with dynamicStableScale could go under maxUnavailable Signed-off-by: Jesse Suen <[email protected]> * test: add unit tests Signed-off-by: Jesse Suen <[email protected]> * test: add e2e tests Signed-off-by: Jesse Suen <[email protected]> * refactor: move isReplicaSetReferenced to replicaset.go Signed-off-by: Jesse Suen <[email protected]> --------- Signed-off-by: Jesse Suen <[email protected]>

…e pods (#3077) * fix: rollback to stable with dynamicStableScale could go under maxUnavailable Signed-off-by: Jesse Suen <[email protected]> * test: add unit tests Signed-off-by: Jesse Suen <[email protected]> * test: add e2e tests Signed-off-by: Jesse Suen <[email protected]> * refactor: move isReplicaSetReferenced to replicaset.go Signed-off-by: Jesse Suen <[email protected]> --------- Signed-off-by: Jesse Suen <[email protected]> Signed-off-by: zachaller <[email protected]>

…e pods (argoproj#3077) * fix: rollback to stable with dynamicStableScale could go under maxUnavailable Signed-off-by: Jesse Suen <[email protected]> * test: add unit tests Signed-off-by: Jesse Suen <[email protected]> * test: add e2e tests Signed-off-by: Jesse Suen <[email protected]> * refactor: move isReplicaSetReferenced to replicaset.go Signed-off-by: Jesse Suen <[email protected]> --------- Signed-off-by: Jesse Suen <[email protected]> Signed-off-by: balasoiu <[email protected]>

Gabryel8818 · 2024-04-08T18:55:29Z

Hello guys, I have same problem in 1.6.6

jessesuen force-pushed the fix/rollback-stable-unavailability branch from 319d0a8 to 4c4bfd7 Compare October 4, 2023 21:44

jessesuen changed the title ~~fix: rollback to stable with dynamicStableScale could go under maxUnavailable~~ fix: rollback to stable with dynamicStableScale could overwhelm stable pods Oct 4, 2023

jessesuen force-pushed the fix/rollback-stable-unavailability branch 2 times, most recently from cadd42b to 722ebb4 Compare October 5, 2023 02:42

jessesuen marked this pull request as ready for review October 5, 2023 16:00

zachaller added cherry-pick/release-1.4 cherry-pick/release-1.5 cherry-pick/release-1.6 labels Oct 5, 2023

jessesuen force-pushed the fix/rollback-stable-unavailability branch 4 times, most recently from ed0d222 to 13099ef Compare October 10, 2023 23:11

zachaller reviewed Oct 13, 2023

View reviewed changes

rollout/canary.go Outdated Show resolved Hide resolved

jessesuen force-pushed the fix/rollback-stable-unavailability branch from 13099ef to 86d83ed Compare October 17, 2023 03:25

zachaller approved these changes Oct 23, 2023

View reviewed changes

zachaller enabled auto-merge (squash) October 23, 2023 11:43

jessesuen added 4 commits October 23, 2023 06:44

fix: rollback to stable with dynamicStableScale could go under maxUna…

a350a8a

…vailable Signed-off-by: Jesse Suen <[email protected]>

test: add unit tests

5f2cc37

Signed-off-by: Jesse Suen <[email protected]>

test: add e2e tests

006c90c

Signed-off-by: Jesse Suen <[email protected]>

refactor: move isReplicaSetReferenced to replicaset.go

cc1a941

Signed-off-by: Jesse Suen <[email protected]>

zachaller force-pushed the fix/rollback-stable-unavailability branch from 86d83ed to cc1a941 Compare October 23, 2023 11:44

zachaller merged commit 7f25955 into argoproj:master Oct 23, 2023
24 checks passed

jessesuen deleted the fix/rollback-stable-unavailability branch October 25, 2023 17:02

zachaller added cherry-pick-completed Used once we have cherry picked the PR to all requested releases and removed cherry-pick/release-1.4 labels Oct 30, 2023

ashutosh16 mentioned this pull request Feb 6, 2024

Rollout superseding an in progress rollout gets stuck #3331

Closed

2 tasks

zachaller mentioned this pull request Feb 15, 2024

when dynamicStableScale is enabled, the 2nd canary deployment (when the first one didn't finish) shifts the traffic to the stable 100% without waiting for it to scale up #3372

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rollback to stable with dynamicStableScale could overwhelm stable pods #3077

fix: rollback to stable with dynamicStableScale could overwhelm stable pods #3077

jessesuen commented Oct 4, 2023 •

edited

Loading

jessesuen commented Oct 4, 2023 •

edited

Loading

codecov bot commented Oct 4, 2023 •

edited

Loading

github-actions bot commented Oct 4, 2023 •

edited

Loading

github-actions bot commented Oct 4, 2023 •

edited

Loading

jessesuen commented Oct 5, 2023 •

edited

Loading

sonarcloud bot commented Oct 23, 2023

Gabryel8818 commented Apr 8, 2024

fix: rollback to stable with dynamicStableScale could overwhelm stable pods #3077

fix: rollback to stable with dynamicStableScale could overwhelm stable pods #3077

Conversation

jessesuen commented Oct 4, 2023 • edited Loading

jessesuen commented Oct 4, 2023 • edited Loading

codecov bot commented Oct 4, 2023 • edited Loading

Codecov Report

github-actions bot commented Oct 4, 2023 • edited Loading

Go Published Test Results

github-actions bot commented Oct 4, 2023 • edited Loading

E2E Tests Published Test Results

jessesuen commented Oct 5, 2023 • edited Loading

sonarcloud bot commented Oct 23, 2023

Gabryel8818 commented Apr 8, 2024

jessesuen commented Oct 4, 2023 •

edited

Loading

jessesuen commented Oct 4, 2023 •

edited

Loading

codecov bot commented Oct 4, 2023 •

edited

Loading

github-actions bot commented Oct 4, 2023 •

edited

Loading

github-actions bot commented Oct 4, 2023 •

edited

Loading

jessesuen commented Oct 5, 2023 •

edited

Loading