-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After abort, rollout does not update Istio DestinationRule #1838
Comments
Hi,@mksha, could you confirm if this happens to v0.3.0? Have you tried with v1.1.1 or the latest argo-rollouts image? |
yes, we are using 1.1.1 |
@mksha , could you paste here the output from |
|
|
@huikang let me know if you need more details. |
Hi, @mksha , thanks for the detailed info. It would be even better if you can provide the output of |
Also how about the status of the virtual service |
I couldn't reproduce the error with the latest master branch: after abort an update, the virtualservice is set with 100 weght to the stable subset
The destination has the correct pod labels:
And the stable pod is still running
|
@mksha , in your case, I noticed the following in the rollout status
The canary pod is supposed to have a different hash value. I remember there was some bug related to this fixed after v1.1.1. Could you try using the latest version? |
|
@huikang there you go |
after the Aborting rollout |
@mksha , could you try the latest image? I think the issue has been fixed there. |
HI @huikang , after the abort in destinationrule both subsets have same rollouts-pod-template-hash: f4dcdc666, bec canary pod is not there to server the canary traffic, so if destination rules does not point to stable for both stable and canary then canary requests will fail. |
@huikang do you mean trying image created from master branch ? |
|
from the rollout that was aborted using image v1.1.1 |
But still destination rule is not updated , after canary pod deletion.
|
@huikang after the abort both subsets should point to stable hash |
or the pod hash label should be removed completely after abort |
I just figured out that example
But in our case after abort, its
that is what is being copied to destination rule that is why its causing issues. |
Tested with the latest version and got the following after abort a canary update
The destinationrul matches the status :
|
@huikang which means its wrong, bec in that case canary subset will point to the canary replicaset that does not have any pod running meaning canary requests will fail. Ideally when we do the canary deployment in production, then some amount of traffic will be going to canary version of app and we will be performing some analysis and if analysis failed, meaning rollout aborted in that case, we will have canary subset pointing to canary replica set that does not have any pod meaning all clients using canary version will see issues. So we need to have same behavior as we have at the end of full promotion where both canary and stable subset points to stable replicaset. |
Although the canary of the rollout status points to the replicaset with 0 pods, the virtualservice have 0 weight to the canary rs; so no traffic will be sent to the replicaset.
I agree that the status should reset to the status where canary and stable points to the stable replicaset after update is aborted. |
even if virtual service is having weight 0 for the canary, there is a case when you have session affinity enabled then in that case your traffic will still go to canary replica set.becuase in virtual service you have that rule defined for our canary session affinity cookie. Apart from that, let me know if i can help to fix this bug, i already found the place when we are updating the hash but not able to find why its diff that full promotion. argo-rollouts/rollout/trafficrouting.go Lines 130 to 179 in 5f0f8b4
if you take a look at above code in all case it calls updatehash, but we need to find a place where its session canaryHash to stableHash in case of full promotion. |
I think the root cause is argo-rollouts/rollout/trafficrouting.go Lines 126 to 128 in 1867742
Even after the update is aborted, argo-rollouts/rollout/replicaset.go Lines 144 to 147 in 0309eb6
I might miss some cases. @jessesuen , could you provide any guidance? |
I don't think this would work because
Fundamentally, this is the issue. We currently assume that if we set a weight of 0, then no traffic will hit the RS. But you are explaining that this isn't always the case when used in conjunction with session affinity. Can you provide the example VirtualService/DestinationRule with session affinity that causes this so I can better understand? When this combination of session affinity canary weight happens, how does istio decide which takes priority? |
basically, in Virtualservice we have a rule based on that if In this case, if the destination rule canary subset is pointing to a pod that is not there after rollout abort then it will cause all canary requests to be failed. |
So as a solution what we need is, we can keep the RS status whatever it is, but we can update the DestinationRule and just remove the pod-hash-template labels completely from both subset canary and stable when rollout is aborted. When we start the rollout again or retry then just set the destination rule to have labels of pod-hash-templates. |
@jessesuen @huikang any feedback on above? |
I think we need to have that login outside of if c.shouldDelayScaleDownOnAbort() {} block, because in case of rollout abort it will not go inside that If block at all. |
@huikang @jessesuen any thoughts? |
@jgwest anhy thoughts on this? |
This issue is stale because it has been open 60 days with no activity. |
any news of this issue? |
Not yet would be open to PR's I might have time to get to it for 1.6 but no guarantee |
This issue is stale because it has been open 60 days with no activity. |
We have the same issue with more serious situation Let us know if you need more debugging information |
Summary
What happened/what you expected to happen?
When we abort the rollout then it should update the Istio destination rule so in destinationrule we have subsets with correct rollouts-pod-template-hash label.
But currently when we abort the rollout it does not update the destination rule, causing cannery endpoint to fail.
Diagnostics
What version of Argo Rollouts are you running?
0.3.0
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: