fix(trafficrouting): Do not block the switch of service selectors for single pod failures #2441

zachaller · 2022-11-29T17:04:37Z

fixes: #2050
fixes: #2235

Signed-off-by: zachaller <[email protected]>

github-actions · 2022-11-29T17:36:39Z

Go Published Test Results

1 807 tests 1 807 ✔️ 2m 31s ⏱️
  102 suites       0 💤
      1 files       0 ❌

Results for commit a621f05.

♻️ This comment has been updated with latest results.

github-actions · 2022-11-29T17:46:50Z

E2E Tests Published Test Results

    2 files     2 suites 1h 36m 41s ⏱️
  92 tests   85 ✔️ 3 💤 4 ❌
190 runs 178 ✔️ 6 💤 6 ❌

For more details on these failures, see this check.

Results for commit a621f05.

♻️ This comment has been updated with latest results.

Signed-off-by: zachaller <[email protected]>

codecov · 2022-11-29T18:19:29Z

Codecov Report

Base: 81.58% // Head: 81.59% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (a621f05) compared to base (3342427).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2441      +/-   ##
==========================================
+ Coverage   81.58%   81.59%   +0.01%     
==========================================
  Files         124      124              
  Lines       18959    18973      +14     
==========================================
+ Hits        15467    15481      +14     
  Misses       2702     2702              
  Partials      790      790

Impacted Files	Coverage Δ
rollout/service.go	`78.97% <100.00%> (+1.24%)`	⬆️
utils/replicaset/replicaset.go	`88.32% <100.00%> (+0.05%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: zachaller <[email protected]>

mayzhang2000 · 2022-12-02T19:51:50Z

rollout/service.go

+
+		if checkRsAvailability && !replicasetutil.IsReplicaSetPartiallyAvailable(rs) {
+			logCtx.Infof("delaying service switch from %s to %s: ReplicaSet has zero availability", currSelector, desiredSelector)
+			return nil


When service switch is delayed, should it return err, so that it will not move on to the next step reconcileTrafficRouting?

Yes that was actually one of the first things that was tried and also does fix the issue is by introducing this error state however there are dependencies within rollouts that we still have to continue reconciling to keep progressing. There is quite a bit of background here #2187 as well as here #1777 where the intial return nil was introduced. It was introduced to prevent the adoption of a services causing downtime by making sure the service was fully available. The change in this PR keeps that behavior for the adoption case but switches to only needing some availability for the cases that we have already adopted the service.

rollout/service.go

Signed-off-by: zachaller <[email protected]>

sonarcloud · 2022-12-05T16:59:52Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.6% Duplication

leoluz

LGTM!
Thank you!

jstewart612 · 2022-12-07T21:16:44Z

@leoluz what else remains to get this merged. Any chance of it being considered a more urgent bug for a second point release rather than having to wait all the way until 1.4?

This has caused my organization two major production outages, and I'd like to not make it a third :)

jessesuen

Great work!

jstewart612 · 2022-12-07T21:33:58Z

@jessesuen are you a more appropriate party to ask how fast-tracked this can be? This is a really fatal bug and tbqh I'd like to see an emerg 1.3.x process for this but idk how to go about asking for that here.

leoluz · 2022-12-08T00:28:58Z

@jstewart612 I spoke with @zachaller about it and we can do a 1.3 patch release with this fix once we are ready.

If you can't wait for the official patch release, as an alternative, you could do an "in-house" release for your organization.

jstewart612 · 2022-12-08T00:40:57Z

@leoluz :

I appreciate the response.

That's the aforementioned monkey-patching that I'd really like to avoid doing... but will if I must. Any idea of timetable on what was discussed on getting that done?

zachaller · 2022-12-08T01:02:09Z

@jstewart612 We are planning to following the timetable as close as possible for new releases such as 1.4 here as for the 1.3.2 patch release as soon as we hear back from the people at plaid that the fix does indeed seem solve the issue we can do a 1.3.2 release.

jstewart612 · 2022-12-08T01:05:52Z

@zachaller I'm not them, much to my chagrin, but can confirm this resolves the issue in my local dev env testing.

jstewart612 · 2022-12-08T15:32:25Z

@raxod502-plaid @jandersen-plaid are you the aforementioned Plaid people?

raxod502-plaid · 2022-12-08T15:34:30Z

@MarkSRobinson was testing this. Deployments take some time and the issue was never possible to reproduce on demand, so confirmation may take some time.

jstewart612 · 2022-12-08T15:42:58Z

@raxod502-plaid @MarkSRobinson considering argoproj official maintainers seem to be holding on your word, could you speed that up?

jandersen-plaid · 2022-12-08T15:49:10Z

@raxod502-plaid @MarkSRobinson considering argoproj official maintainers seem to be holding on your word, could you speed that up?

If you would like to test this change you are welcome to -- usually this situation arises when there is a lack of nodes or scheduling issues, so you will just have to simulate nodes failing to come up or pods failing to schedule and hope that you hit this bug.

In terms of Plaid "speeding things up": if you believe this change will solve the issue then you are welcome to fork the project and apply this PR as a patch or make whatever changes you would like to solve this particular issue (considering this is an open source project). If you want guarantees about timely fixes then you would have to talk to the argo team who may or may not be able to offer you a support contract. For Plaid specifically, we are working at our own pace given internal priorities and resourcing and offer no guarantees about when we will be finished testing the issue.

jstewart612 · 2022-12-08T17:08:00Z

@zachaller based on this #2441 (comment) it seems a bad idea to wait on resource-strapped reporters for QA who make no guarantees about their ability to follow up. What data do you require of me to validate the fix beyond my affirmation that it has done so in my environment?

jstewart612 · 2022-12-08T17:08:58Z

and @jandersen-plaid don't worry: I know what monkey-patching is and my org is already doing it as a result of having to wait here. I would just like to stop doing it because of all the caveats of running in that state. As you rightly note though, you are neither the one deciding to wait on your word nor presently possessed of the resources to do this timely (empathy on that latter point!).

zachaller · 2022-12-08T17:27:08Z

@jstewart612 I would be very curious on how you tested this locally because part of the reason I am waiting on their response is I was never able to test this locally this code deals with a very specific window of time at the end of a rollout that is quite hard if not impossible to replicate locally hence I am curious on how you tested because it could allow me to create a e2e test for this issue. It would also give me more confidence in knowing how you reproduced and tested it. I do not plan on waiting for their response in order for this change to make it into the 1.4 release because I am still pretty confident that the fixed behavior is correct. This issue should also really mostly affect dynamicStableScale: true so you could turn that off as well and probably not experience downtime.

@jandersen-plaid Sorry to hear about the layoffs I hope it does not affect you and others that have helped the rollouts community

jstewart612 · 2022-12-08T17:34:16Z

@zachaller any Rollout. dynamicStableScaling true. Change the image tag as to trigger a rollout. Happens to us every time that the Rollout entirely completes, but the rollouts-pod-template-hash on the stableService remains the old value indefinitely until we manually change it and I don't see anything in the logs indicating a failure or attempt to change this.

jstewart612 · 2022-12-08T18:54:49Z

@zachaller also I have tested dynamicStableScale false. Now, I don't have downtime, but it takes over 35 minutes to scale down 25 pods and tons of update errors about replicaset being modified cannot apply show up in my logs. This, I am going to assume, is the real problem.

zachaller · 2022-12-08T19:08:23Z

@jstewart612 Yea I was going to say there must be something going on in your cluster that is causing the canary to never go 100% healthy causing the outage. With the very slow scale down it sounds like something is also updating your replicasets preventing rollouts controller from updating the replica counts.

jstewart612 · 2022-12-08T19:25:41Z

@zachaller at the same time: you shouldn't be turning down, in the dynamicStableScale true case, the stable RS while under this condition, agreed (the action that causes the downtime)?

jstewart612 · 2022-12-08T20:12:10Z

@zachaller well turns out it's Lacework Admissions Controller and Argo Rollouts conflicting with each other. I'll separately bug-ticket them about that.

dynamicStableScale true is still broken as previously claimed, even with this patch, but false does fine when someone else isn't jumping in and modifying objects out from under it :)

zachaller · 2022-12-08T20:30:07Z

dynamicStableScale true is still broken as previously claimed, even with this patch, but false does fine when someone else isn't jumping in and modifying objects out from under it :)

What makes you say this patch does not fix the dynamicStableScale: true case?

jstewart612 · 2022-12-08T21:16:34Z

@zachaller The new canary RS was up with one or more pods. The Istio VirtualService was adjusted to weight one hundred percent to the canaryService (which was set as the final step of this particular Rollout). The stable RS then scaled down to zero. The Istio VirtualService then switched weights back to the stableService. The rollouts-pod-template-hash label selector on the stableService was still pointed at the old pod template hash it had just scaled to 0. Outage.

harikrongali · 2022-12-09T23:20:16Z

@zachaller what is the risk of merging the code & releasing a patch?

zachaller · 2022-12-09T23:41:52Z

I plan on releasing a patch next week with or without an update from plaid, I am going to go through pr's and see if there are any other issues worth pulling in as well.

… single pod failures (#2441) * fix(traficrouter): WIP on not setting weight if not available Signed-off-by: zachaller <[email protected]> * fix tests Signed-off-by: zachaller <[email protected]> * try bailing vs setting weight Signed-off-by: zachaller <[email protected]> * work with expirments that do not set any weights Signed-off-by: zachaller <[email protected]> * fix test by commenting out code Signed-off-by: zachaller <[email protected]> * lint Signed-off-by: zachaller <[email protected]> * simplify logic Signed-off-by: zachaller <[email protected]> * switch logic Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more test Signed-off-by: zachaller <[email protected]> * refactor test Signed-off-by: zachaller <[email protected]> * refactor code to reduce duplication Signed-off-by: zachaller <[email protected]> * change comments a bit Signed-off-by: zachaller <[email protected]> * remove else Signed-off-by: zachaller <[email protected]> Signed-off-by: zachaller <[email protected]>

… single pod failures (argoproj#2441) * fix(traficrouter): WIP on not setting weight if not available Signed-off-by: zachaller <[email protected]> * fix tests Signed-off-by: zachaller <[email protected]> * try bailing vs setting weight Signed-off-by: zachaller <[email protected]> * work with expirments that do not set any weights Signed-off-by: zachaller <[email protected]> * fix test by commenting out code Signed-off-by: zachaller <[email protected]> * lint Signed-off-by: zachaller <[email protected]> * simplify logic Signed-off-by: zachaller <[email protected]> * switch logic Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more comments Signed-off-by: zachaller <[email protected]> * add more test Signed-off-by: zachaller <[email protected]> * refactor test Signed-off-by: zachaller <[email protected]> * refactor code to reduce duplication Signed-off-by: zachaller <[email protected]> * change comments a bit Signed-off-by: zachaller <[email protected]> * remove else Signed-off-by: zachaller <[email protected]> Signed-off-by: zachaller <[email protected]> Signed-off-by: Travis Perdue <[email protected]>

fix(traficrouter): WIP on not setting weight if not available

6c21625

Signed-off-by: zachaller <[email protected]>

zachaller force-pushed the fix-selector-switch branch from 65cd406 to 6c21625 Compare November 29, 2022 17:05

zachaller changed the title ~~fix(traficrouter): WIP on not setting weight if not available~~ fix(trafficrouter): WIP on not setting weight if not available Nov 29, 2022

zachaller changed the title ~~fix(trafficrouter): WIP on not setting weight if not available~~ fix(trafficrouting): WIP on not setting weight if not available Nov 29, 2022

fix tests

fdf404c

Signed-off-by: zachaller <[email protected]>

zachaller added 8 commits November 30, 2022 08:45

try bailing vs setting weight

14481d2

Signed-off-by: zachaller <[email protected]>

work with expirments that do not set any weights

5d64b74

Signed-off-by: zachaller <[email protected]>

fix test by commenting out code

18264b9

Signed-off-by: zachaller <[email protected]>

lint

c13ea8f

Signed-off-by: zachaller <[email protected]>

simplify logic

d963b34

Signed-off-by: zachaller <[email protected]>

switch logic

9a725ed

Signed-off-by: zachaller <[email protected]>

add more comments

4aabe8c

Signed-off-by: zachaller <[email protected]>

add more comments

7df9ff6

Signed-off-by: zachaller <[email protected]>

zachaller added this to the v1.4 milestone Dec 1, 2022

zachaller marked this pull request as ready for review December 1, 2022 14:49

add more test

516adc0

Signed-off-by: zachaller <[email protected]>

zachaller changed the title ~~fix(trafficrouting): WIP on not setting weight if not available~~ fix(trafficrouting): Do not block switch of service selectors for single pod failures Dec 1, 2022

zachaller changed the title ~~fix(trafficrouting): Do not block switch of service selectors for single pod failures~~ fix(trafficrouting): Do not block the switch of service selectors for single pod failures Dec 1, 2022

refactor test

3e73aae

Signed-off-by: zachaller <[email protected]>

This was referenced Dec 1, 2022

fix: return an error when we cannot swap the replicaset hashes fixes #2050 #2187

Closed

Argo rollouts will scale down stable when canary is missing pods #2050

Closed

zachaller requested review from jessesuen and leoluz December 2, 2022 19:18

mayzhang2000 reviewed Dec 2, 2022

View reviewed changes

MarkSRobinson mentioned this pull request Dec 2, 2022

fix(trafficrouting): Backport traffic routing fix #2441 #2449

Closed

6 tasks

leoluz reviewed Dec 5, 2022

View reviewed changes

rollout/service.go Outdated Show resolved Hide resolved

refactor code to reduce duplication

e04123e

Signed-off-by: zachaller <[email protected]>

remove else

a621f05

Signed-off-by: zachaller <[email protected]>

leoluz approved these changes Dec 5, 2022

View reviewed changes

jessesuen approved these changes Dec 7, 2022

View reviewed changes

zachaller merged commit b4c1d9e into argoproj:master Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trafficrouting): Do not block the switch of service selectors for single pod failures #2441

fix(trafficrouting): Do not block the switch of service selectors for single pod failures #2441

zachaller commented Nov 29, 2022 •

edited

Loading

github-actions bot commented Nov 29, 2022 •

edited

Loading

github-actions bot commented Nov 29, 2022 •

edited

Loading

codecov bot commented Nov 29, 2022 •

edited

Loading

mayzhang2000 Dec 2, 2022

zachaller Dec 2, 2022 •

edited

Loading

sonarcloud bot commented Dec 5, 2022

leoluz left a comment

jstewart612 commented Dec 7, 2022

jessesuen left a comment

jstewart612 commented Dec 7, 2022

leoluz commented Dec 8, 2022

jstewart612 commented Dec 8, 2022

zachaller commented Dec 8, 2022 •

edited

Loading

jstewart612 commented Dec 8, 2022

jstewart612 commented Dec 8, 2022

raxod502-plaid commented Dec 8, 2022

jstewart612 commented Dec 8, 2022 •

edited

Loading

jandersen-plaid commented Dec 8, 2022 •

edited

Loading

jstewart612 commented Dec 8, 2022

jstewart612 commented Dec 8, 2022 •

edited

Loading

zachaller commented Dec 8, 2022 •

edited

Loading

jstewart612 commented Dec 8, 2022

jstewart612 commented Dec 8, 2022 •

edited

Loading

zachaller commented Dec 8, 2022 •

edited

Loading

jstewart612 commented Dec 8, 2022 •

edited

Loading

jstewart612 commented Dec 8, 2022

zachaller commented Dec 8, 2022

jstewart612 commented Dec 8, 2022

harikrongali commented Dec 9, 2022

zachaller commented Dec 9, 2022

fix(trafficrouting): Do not block the switch of service selectors for single pod failures #2441

fix(trafficrouting): Do not block the switch of service selectors for single pod failures #2441

Conversation

zachaller commented Nov 29, 2022 • edited Loading

github-actions bot commented Nov 29, 2022 • edited Loading

Go Published Test Results

github-actions bot commented Nov 29, 2022 • edited Loading

E2E Tests Published Test Results

codecov bot commented Nov 29, 2022 • edited Loading

Codecov Report

mayzhang2000 Dec 2, 2022

Choose a reason for hiding this comment

zachaller Dec 2, 2022 • edited Loading

Choose a reason for hiding this comment

sonarcloud bot commented Dec 5, 2022

leoluz left a comment

Choose a reason for hiding this comment

jstewart612 commented Dec 7, 2022

jessesuen left a comment

Choose a reason for hiding this comment

jstewart612 commented Dec 7, 2022

leoluz commented Dec 8, 2022

jstewart612 commented Dec 8, 2022

zachaller commented Dec 8, 2022 • edited Loading

jstewart612 commented Dec 8, 2022

jstewart612 commented Dec 8, 2022

raxod502-plaid commented Dec 8, 2022

jstewart612 commented Dec 8, 2022 • edited Loading

jandersen-plaid commented Dec 8, 2022 • edited Loading

jstewart612 commented Dec 8, 2022

jstewart612 commented Dec 8, 2022 • edited Loading

zachaller commented Dec 8, 2022 • edited Loading

jstewart612 commented Dec 8, 2022

jstewart612 commented Dec 8, 2022 • edited Loading

zachaller commented Dec 8, 2022 • edited Loading

jstewart612 commented Dec 8, 2022 • edited Loading

jstewart612 commented Dec 8, 2022

zachaller commented Dec 8, 2022

jstewart612 commented Dec 8, 2022

harikrongali commented Dec 9, 2022

zachaller commented Dec 9, 2022

zachaller commented Nov 29, 2022 •

edited

Loading

github-actions bot commented Nov 29, 2022 •

edited

Loading

github-actions bot commented Nov 29, 2022 •

edited

Loading

codecov bot commented Nov 29, 2022 •

edited

Loading

zachaller Dec 2, 2022 •

edited

Loading

zachaller commented Dec 8, 2022 •

edited

Loading

jstewart612 commented Dec 8, 2022 •

edited

Loading

jandersen-plaid commented Dec 8, 2022 •

edited

Loading

jstewart612 commented Dec 8, 2022 •

edited

Loading

zachaller commented Dec 8, 2022 •

edited

Loading

jstewart612 commented Dec 8, 2022 •

edited

Loading

zachaller commented Dec 8, 2022 •

edited

Loading

jstewart612 commented Dec 8, 2022 •

edited

Loading