You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey folks, we hit what we believe is a pretty nasty/unlucky edge case wherein a rollout was completed and scaled down the old replicaset in a blue green while still sending 100% traffic to the old replicaset's target group. This led to an outage for us that unfortunately lasted a few minutes. To set some context, we use the following relatively simple strategy:
on the traffic side we use ALBs as managed by the AWS Loadbalancer Controller (running on EKS). We also have TargetGroup Weight Verification enabled. During our rollout, the AWS Loadbalancer Controller and Argo Rollouts as well experienced throttling from the AWS elasticloadbalancing APIs, leading to both slow convergence (and errors)... which is a problem on its own, but not one we would expect to generally cause the rollout to end up in a bad state. Unfortunately, however there is an apparent gap.
In this instance, our rollout proceeded fairly normally up to the setWeight: 100 step. This step successfully applied to the ingress, but failed to verify due to operation error Elastic Load Balancing v2: DescribeLoadBalancers, exceeded maximum number of attempts, 3. Let's call this reconciliation (1). At this time on the actual ALB, the weight is correctly at 100% to canary target group, and stable is set to 0%, but we fail to verify due to the throttling. We now enter reconciliation (2)
Traffic Reconciliation (2) made the decision to set the desired weight of the canary back to 0 and the weight of the stable to 100. We determined this occurred from entering this branch in reconciliation:
, our stable RS at this time actually had a pod churn for unrelated reasons, so rollouts made the call to set the traffic weights back, because of this check https://github.com/argoproj/argo-rollouts/blob/master/utils/replicaset/canary.go#L41. That's usually fine to do... however the slowness in the AWS LB Controller processing this update leads to a problem. We now queue up this update (canary 0 / stable 100) in AWS LB Controller (which is very slow and backed up due to throttling). We fail to verify this as well, because the actual ALB is still at the state from Reconciliation (1) (canary 100 / stable 0).
We now start Traffic Reconciliation (3) where things go wrong. At this point, the ALB is still at the state of the world from (1) (canary 100 / stable 0) but we also have the AWS LB Controller slowly, with backoff, attempting to make the update from (1) (canary 0 / stable 100). At this point, reconciliation decides the replicasets are stable and to respect the Set Weight: 100 again, and our desired state is canary 100 / stable 0. We apply this to the Ingress. AWS LB Controller does not yet sync this update, still processing (2). We verify the state of the world, we check against the ALB state which is still (canary 100 / stable 0) from (1). We pass verification and complete the rollout from a traffic shifting perspective. AWS LB Controller finally converges update (2) and re-reads the ingress object to start converging (3). At this point the ALB is 100% stable, 0% canary. Argo Completes the rollout and begins scaling down the previous stable which is still getting 100% traffic as it is now considered old. After 30s, our configured scale-down wait, the old replicaset is scaled down, we are now send 100% traffic to a target group of nothing. After 5 minutes, the AWS LB controller finally manages to converge setting traffic to 100% canary (now stable) and 0% to the previous and our outage ends.
To Reproduce
Fairly complicated set of distributed state machines to do so, but the configuration and components in use are detailed above.
Expected behavior
I think probably the fundamental issue here is that we verify a weight from a previous state of the world (state (1)) when we should be verifying state (3). In other words, the verification in step (3) should probably fail because that is not the set of weights that was set by this update.
Screenshots
Version
1.5.1
Logs
# Paste the logs from the rollout controller
# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts
# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME
Checklist:
Describe the bug
Hey folks, we hit what we believe is a pretty nasty/unlucky edge case wherein a rollout was completed and scaled down the old replicaset in a blue green while still sending 100% traffic to the old replicaset's target group. This led to an outage for us that unfortunately lasted a few minutes. To set some context, we use the following relatively simple strategy:
on the traffic side we use ALBs as managed by the AWS Loadbalancer Controller (running on EKS). We also have TargetGroup Weight Verification enabled. During our rollout, the AWS Loadbalancer Controller and Argo Rollouts as well experienced throttling from the AWS elasticloadbalancing APIs, leading to both slow convergence (and errors)... which is a problem on its own, but not one we would expect to generally cause the rollout to end up in a bad state. Unfortunately, however there is an apparent gap.
In this instance, our rollout proceeded fairly normally up to the
setWeight: 100
step. This step successfully applied to the ingress, but failed to verify due tooperation error Elastic Load Balancing v2: DescribeLoadBalancers, exceeded maximum number of attempts, 3
. Let's call this reconciliation (1). At this time on the actual ALB, the weight is correctly at 100% to canary target group, and stable is set to 0%, but we fail to verify due to the throttling. We now enter reconciliation (2)Traffic Reconciliation (2) made the decision to set the desired weight of the canary back to 0 and the weight of the stable to 100. We determined this occurred from entering this branch in reconciliation:
argo-rollouts/rollout/trafficrouting.go
Line 213 in 723f7a9
We now start Traffic Reconciliation (3) where things go wrong. At this point, the ALB is still at the state of the world from (1) (canary 100 / stable 0) but we also have the AWS LB Controller slowly, with backoff, attempting to make the update from (1) (canary 0 / stable 100). At this point, reconciliation decides the replicasets are stable and to respect the Set Weight: 100 again, and our desired state is
canary 100 / stable 0
. We apply this to the Ingress. AWS LB Controller does not yet sync this update, still processing (2). We verify the state of the world, we check against the ALB state which is still (canary 100 / stable 0) from (1). We pass verification and complete the rollout from a traffic shifting perspective. AWS LB Controller finally converges update (2) and re-reads the ingress object to start converging (3). At this point the ALB is 100% stable, 0% canary. Argo Completes the rollout and begins scaling down the previous stable which is still getting 100% traffic as it is now considered old. After 30s, our configured scale-down wait, the old replicaset is scaled down, we are now send 100% traffic to a target group of nothing. After 5 minutes, the AWS LB controller finally manages to converge setting traffic to 100% canary (now stable) and 0% to the previous and our outage ends.To Reproduce
Fairly complicated set of distributed state machines to do so, but the configuration and components in use are detailed above.
Expected behavior
I think probably the fundamental issue here is that we verify a weight from a previous state of the world (state (1)) when we should be verifying state (3). In other words, the verification in step (3) should probably fail because that is not the set of weights that was set by this update.
Screenshots
Version
1.5.1
Logs
Many logs, sharing the full dump here. I added
# NB
at the relevant reconciliation points.https://gist.github.com/dlmather/45b620e2a388aed4a9342c5b3aff4510
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: