-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argo rollouts will scale down stable when canary is missing pods #2050
Comments
can you post manifest files that you are using? |
I've attached the Rollout manifest. I can attach others if they'd be useful. We're using LinkerD as the service mesh topology in the cluster. |
@perenesenko can you provide your findings here? |
@MarkSRobinson
Could you also provide the rollout info also:
|
Due to logs, I see that canary pods should be reached 100% ready as I see the next log in the first line:
We're shifting the traffic only in case the canary pods readiness number reached to the desired number. So It should be the 100%
Have to figure out why on a previous step 100% was ready, but it's not ready on a step with switching the labels. |
Here's the full rollout object What I can figure happened is that we had a networking problem causing 30 pods to fall off being ready (either the node itself died or networking on the node died, we're not certain). From the logs, it sounds like this could be prevent by either forcing the networking switch knowing pods are missing or cancelling the scale down until the networking is switched successfully. |
…rgoproj#2050 Signed-off-by: Jack Andersen <[email protected]>
I think the key issue is that the rollout is continuing reconciliation when the selectors are delayed in being swapped. That is, https://sourcegraph.com/github.com/argoproj/argo-rollouts/-/blob/rollout/service.go?L284 returns If there was an error within the service "ensureSVCTargets" function then the controller would skip this cycle (as is expected) and not continue to update the weights on the rollout (resulting in no traffic loss). I have added a pull request that I think will fix this issue, but let me know if I am incorrect here. |
One of our apps with over 170 pods suffered an outage due to this bug #2235. Reading this issue, it seems this bug always had the potential to hit since using Sometimes disabling |
…rgoproj#2050 Signed-off-by: Jack Andersen <[email protected]>
…rgoproj#2050 Signed-off-by: Jack Andersen <[email protected]>
This is also an interesting #1820 |
…rgoproj#2050 Signed-off-by: Jack Andersen <[email protected]>
…rgoproj#2050 Signed-off-by: Jack Andersen <[email protected]>
For what it is worth, I think #2187 is ready to go -- I had originally scoped it out to all configurations of argo rollouts, but some of the end to end tests actually rely on swapping service selectors when the canary replicaset is still not ready, so I dialed it back to just I have applied the patch of the change to the Please test out this new version before you put it into a production environment: I was not able to construct a consistent test for this because the failure mode relies on obstructing pods from becoming ready at a specific point in time. That being said, I am confident that it is ready to be tested and cautiously rolled out. |
@jandersen-plaid Thanks for updating that I will also take a look at what you have done to "dial" it back a bit. I have been very slowly working at another fix for this with just changing the calculation to account for availability instead of introducing an error state if it pans out I think it would make a bit more sense to go that route. However if it does not pan out I think what you did will end up also making sense. I will try to get the calculation changes figured out here soon just been busy with lots of other things currently. But I should have some time to really dedicate to finishing it. |
Great! I took a look at the PR and your approach overall seems more correct to me (adding error states generally leads to difficulty in discerning when the state is updated 😢 ), so I look forward to the final result! For what it is worth, I had the exact same end to end tests fail for me as well ( Adding the condition that |
@jandersen-plaid Here is the PR, @MarkSRobinson Are you able to reproduce this enough that you could help test it if I where to get you a build? |
@zachaller The bug doesn't reliably happen. So I can test this out but it might take a while to get feedback. My concern is that this PR is built on the 1.4 branch and I'm not entirely sure I want to test all the changes in production. Let me see if I can backport this to 1.3 release branch. |
Ok, fix back-ported - #2449 |
@MarkSRobinson Do you plan on building a docker image with that patch based on 1.3 or would you like me to? Also note I refactored the PR a bit as well to simplify it |
@zachaller I built it and pushed it to our internal repo. We're testing it out on the testing cluster right now. |
@MarkSRobinson status on this? This has now caused TWO production outages for our organization and, as far as we are concerned, is a massive bug that needs immediate fixing. |
(Just so you know, @MarkSRobinson, @jandersen-plaid, and myself aren't affiliated with the Argo projects, we're users like yourself. Right there with you, this issue has caused outages for us as well and we're excited to help get it fixed as soon as possible.) |
@raxod502-plaid @jandersen-plaid @MarkSRobinson apologies: was doing a lot of avatar clicking to see who was officially on the project and mistook Mark for one of them. |
Summary
Argo scaled down the stable RS even while the new replica set is not fully available. In this case, it scaled from 110 pods down to 0 but didn't update the routing until the new RS was 100% ready.
Diagnostics
What version of Argo Rollouts are you running?
1.2.0
I've tried to brief on these logs, but I can get the full logs if you would like.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: