Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A scaling event during a post promotion analysis can leave rollout in irreconcilable state, as the stableRS is not updated. #3663

Closed
2 tasks done
benminter-treatwell opened this issue Jun 24, 2024 · 1 comment · Fixed by #3664
Labels
bug Something isn't working

Comments

@benminter-treatwell
Copy link
Contributor

benminter-treatwell commented Jun 24, 2024

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug
A scaling event which occurs during a post-promotion analysis template run causes only the active replicaset to scaled and not the stable replicaset. This in-turn causes the rollout to be in an irreconcilable state as it waits for the stable replicaset to have the same number of replicas as the rollout requires.

To Reproduce

  1. Deploy a new version of a bluegreen rollout, with a post promotion analysis template
  2. Wait for the rollout to automatically promote the new replicaset and run it's post promotion analysis template
  3. While it runs the analysis template the new replica set is considered active but not stable
  4. Run a scaling event during this period (while the new replicaset is considered active but not stable)
  5. This causes the active (new replicaset) to be scaled to whatever replica count you request but the stable replicaset is untouched
  6. Wait for the analysis template to finish
  7. Observe that while the (old replicaset) stable is not equal to the replica count desired by the rollout the rollout is stuck waiting to for minimum availability.

Expected behavior
During the period of post promotion analysis, a scaling event should affect both the stable and the active replicaset replicas. This should ensure that in the case there is a genuine traffic spike and error concurrently we rollback to a stable replicaset with enough replicas and also to ensure that we don't get stuck waiting for a rollout that is irreconcilable.

Screenshots

Version
v1.7.0

Logs
infinite loop of:

time="2024-06-21T10:59:40Z" level=info msg="Syncing replicas only due to scaling event" namespace=staging rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="Reconciling stable ReplicaSet 'my-app-6b6649b84d'" namespace=staging rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="No status changes. Skipping patch" generation=1324 namespace=staging resourceVersion=159634477 rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="Queueing up Rollout for a progress check now" namespace=staging rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="Reconciliation completed" generation=1324 namespace=staging resourceVersion=159634477 rollout=my-app time_ms=29.734882
time="2024-06-21T10:59:40Z" level=info msg="Started syncing rollout" generation=1324 namespace=staging resourceVersion=159634477 rollout=my-app
time="2024-06-21T10:59:40Z" level=info msg="Syncing replicas only due to scaling event" namespace=staging rollout=my-app

Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@19neloyk
Copy link

Hey! I've noticed this issue as well; waiting for the PR to be approved and merged in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants