A scaling event during a post promotion analysis can leave rollout in irreconcilable state, as the stableRS is not updated. #3663

benminter-treatwell · 2024-06-24T10:54:19Z

Checklist:

I've included steps to reproduce the bug.
I've included the version of argo rollouts.

Describe the bug
A scaling event which occurs during a post-promotion analysis template run causes only the active replicaset to scaled and not the stable replicaset. This in-turn causes the rollout to be in an irreconcilable state as it waits for the stable replicaset to have the same number of replicas as the rollout requires.

To Reproduce

Deploy a new version of a bluegreen rollout, with a post promotion analysis template
Wait for the rollout to automatically promote the new replicaset and run it's post promotion analysis template
While it runs the analysis template the new replica set is considered active but not stable
Run a scaling event during this period (while the new replicaset is considered active but not stable)
This causes the active (new replicaset) to be scaled to whatever replica count you request but the stable replicaset is untouched
Wait for the analysis template to finish
Observe that while the (old replicaset) stable is not equal to the replica count desired by the rollout the rollout is stuck waiting to for minimum availability.

Expected behavior
During the period of post promotion analysis, a scaling event should affect both the stable and the active replicaset replicas. This should ensure that in the case there is a genuine traffic spike and error concurrently we rollback to a stable replicaset with enough replicas and also to ensure that we don't get stuck waiting for a rollout that is irreconcilable.

Screenshots

Version
v1.7.0

Logs
infinite loop of:

time="2024-06-21T10:59:40Z" level=info msg="Syncing replicas only due to scaling event" namespace=staging rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="Reconciling stable ReplicaSet 'my-app-6b6649b84d'" namespace=staging rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="No status changes. Skipping patch" generation=1324 namespace=staging resourceVersion=159634477 rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="Queueing up Rollout for a progress check now" namespace=staging rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="Reconciliation completed" generation=1324 namespace=staging resourceVersion=159634477 rollout=my-app time_ms=29.734882
time="2024-06-21T10:59:40Z" level=info msg="Started syncing rollout" generation=1324 namespace=staging resourceVersion=159634477 rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="Syncing replicas only due to scaling event" namespace=staging rollout=my-app

Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

19neloyk · 2024-07-11T20:48:51Z

Hey! I've noticed this issue as well; waiting for the PR to be approved and merged in.

benminter-treatwell added the bug Something isn't working label Jun 24, 2024

benminter-treatwell mentioned this issue Jun 24, 2024

fix(controller): use the stableRS from the rollout context rather tha… #3664

Merged

6 tasks

zachaller closed this as completed in #3664 Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A scaling event during a post promotion analysis can leave rollout in irreconcilable state, as the stableRS is not updated. #3663

A scaling event during a post promotion analysis can leave rollout in irreconcilable state, as the stableRS is not updated. #3663

benminter-treatwell commented Jun 24, 2024 •

edited

Loading

19neloyk commented Jul 11, 2024

A scaling event during a post promotion analysis can leave rollout in irreconcilable state, as the stableRS is not updated. #3663

A scaling event during a post promotion analysis can leave rollout in irreconcilable state, as the stableRS is not updated. #3663

Comments

benminter-treatwell commented Jun 24, 2024 • edited Loading

19neloyk commented Jul 11, 2024

benminter-treatwell commented Jun 24, 2024 •

edited

Loading