-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: rollback to stable with dynamicStableScale could overwhelm stable pods #3077
fix: rollback to stable with dynamicStableScale could overwhelm stable pods #3077
Conversation
319d0a8
to
4c4bfd7
Compare
Here is the new order of events after this PR:
|
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #3077 +/- ##
==========================================
+ Coverage 81.82% 81.85% +0.02%
==========================================
Files 134 134
Lines 20507 20556 +49
==========================================
+ Hits 16780 16826 +46
- Misses 2864 2866 +2
- Partials 863 864 +1
☔ View full report in Codecov by Sentry. |
cadd42b
to
722ebb4
Compare
This is ready for review ( EDIT: unit tests and e2e tests have been added |
ed0d222
to
13099ef
Compare
13099ef
to
86d83ed
Compare
…vailable Signed-off-by: Jesse Suen <[email protected]>
Signed-off-by: Jesse Suen <[email protected]>
Signed-off-by: Jesse Suen <[email protected]>
Signed-off-by: Jesse Suen <[email protected]>
86d83ed
to
cc1a941
Compare
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
…e pods (#3077) * fix: rollback to stable with dynamicStableScale could go under maxUnavailable Signed-off-by: Jesse Suen <[email protected]> * test: add unit tests Signed-off-by: Jesse Suen <[email protected]> * test: add e2e tests Signed-off-by: Jesse Suen <[email protected]> * refactor: move isReplicaSetReferenced to replicaset.go Signed-off-by: Jesse Suen <[email protected]> --------- Signed-off-by: Jesse Suen <[email protected]>
…e pods (#3077) * fix: rollback to stable with dynamicStableScale could go under maxUnavailable Signed-off-by: Jesse Suen <[email protected]> * test: add unit tests Signed-off-by: Jesse Suen <[email protected]> * test: add e2e tests Signed-off-by: Jesse Suen <[email protected]> * refactor: move isReplicaSetReferenced to replicaset.go Signed-off-by: Jesse Suen <[email protected]> --------- Signed-off-by: Jesse Suen <[email protected]> Signed-off-by: zachaller <[email protected]>
…e pods (argoproj#3077) * fix: rollback to stable with dynamicStableScale could go under maxUnavailable Signed-off-by: Jesse Suen <[email protected]> * test: add unit tests Signed-off-by: Jesse Suen <[email protected]> * test: add e2e tests Signed-off-by: Jesse Suen <[email protected]> * refactor: move isReplicaSetReferenced to replicaset.go Signed-off-by: Jesse Suen <[email protected]> --------- Signed-off-by: Jesse Suen <[email protected]> Signed-off-by: balasoiu <[email protected]>
Resolves #3020
Changes made:
1. Safer scaledown decisioning of "old" ReplicaSets
When scaling down "old" ReplicaSets, we now additionally check if they are still referenced by services before allowing them to be scaled down. This is an overall safety improvement to make sure we scale down older ReplicaSets only after it doesn't matter anymore (nothing is pointing to them).
2. Delay canary service selector switch, if we are rolling back to stable with dynamicStableScale and it is not fully available
Before this PR, when the user re-applied stable pod spec, the rollout controller would immediately change the service selector of the canary service back to the stable RS. This meant that all production traffic immediately hit the stable RS, even if it was undersized due to dynamicStableScale. This PR detects if
dynamicStableScale
is used, and if stableRS is not fully available, then we delay the service switch until stableRS is fully available. Now that we no longer touch the the canary service in this scenario, it allows the controller to continue to balance traffic between stable vs. previous desiredRS, so that traffic can be safely shifted away from it (improvement 3 below).3. Gradually traffic shift back to stable RS to avoid overwhelming stable pods
A user can re-apply the stable manifest in the middle of an update. When this happens, stableRS == desiredRS. Before this PR, when this happened in the middle of an update, we would immediately shift 100% of the traffic back to stableRS (by setting weight to 0), even though it may have been undersized due to dyanamicStableScaling. With this PR, we will now follow similar logic as abort with dynamicStableScaling, and only increase traffic back to stable in accordance with stable's available replica count. A key difference between abort vs. stable rollback is that:
Because of this difference, the trafficrouting logic has special case logic for traffic splitting to happen between previous desired vs. stable in the event of a rollback to stable.
I will be adding an e2e test for this.
Checklist:
"fix(controller): Updates such and such. Fixes #1234"
.