Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(trafficrouting): Do not block the switch of service selectors for single pod failures #2441

Merged
merged 15 commits into from
Dec 12, 2022

Conversation

zachaller
Copy link
Collaborator

@zachaller zachaller commented Nov 29, 2022

fixes: #2050
fixes: #2235

@zachaller zachaller changed the title fix(traficrouter): WIP on not setting weight if not available fix(trafficrouter): WIP on not setting weight if not available Nov 29, 2022
@zachaller zachaller changed the title fix(trafficrouter): WIP on not setting weight if not available fix(trafficrouting): WIP on not setting weight if not available Nov 29, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Nov 29, 2022

Go Published Test Results

1 807 tests   1 807 ✔️  2m 31s ⏱️
   102 suites         0 💤
       1 files           0

Results for commit a621f05.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 29, 2022

E2E Tests Published Test Results

    2 files      2 suites   1h 36m 41s ⏱️
  92 tests   85 ✔️ 3 💤 4
190 runs  178 ✔️ 6 💤 6

For more details on these failures, see this check.

Results for commit a621f05.

♻️ This comment has been updated with latest results.

Signed-off-by: zachaller <[email protected]>
@codecov
Copy link

codecov bot commented Nov 29, 2022

Codecov Report

Base: 81.58% // Head: 81.59% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (a621f05) compared to base (3342427).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2441      +/-   ##
==========================================
+ Coverage   81.58%   81.59%   +0.01%     
==========================================
  Files         124      124              
  Lines       18959    18973      +14     
==========================================
+ Hits        15467    15481      +14     
  Misses       2702     2702              
  Partials      790      790              
Impacted Files Coverage Δ
rollout/service.go 78.97% <100.00%> (+1.24%) ⬆️
utils/replicaset/replicaset.go 88.32% <100.00%> (+0.05%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
Signed-off-by: zachaller <[email protected]>
@zachaller zachaller added this to the v1.4 milestone Dec 1, 2022
@zachaller zachaller marked this pull request as ready for review December 1, 2022 14:49
Signed-off-by: zachaller <[email protected]>
@zachaller zachaller changed the title fix(trafficrouting): WIP on not setting weight if not available fix(trafficrouting): Do not block switch of service selectors for single pod failures Dec 1, 2022
@zachaller zachaller changed the title fix(trafficrouting): Do not block switch of service selectors for single pod failures fix(trafficrouting): Do not block the switch of service selectors for single pod failures Dec 1, 2022
Signed-off-by: zachaller <[email protected]>

if checkRsAvailability && !replicasetutil.IsReplicaSetPartiallyAvailable(rs) {
logCtx.Infof("delaying service switch from %s to %s: ReplicaSet has zero availability", currSelector, desiredSelector)
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When service switch is delayed, should it return err, so that it will not move on to the next step reconcileTrafficRouting?

Copy link
Collaborator Author

@zachaller zachaller Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that was actually one of the first things that was tried and also does fix the issue is by introducing this error state however there are dependencies within rollouts that we still have to continue reconciling to keep progressing. There is quite a bit of background here #2187 as well as here #1777 where the intial return nil was introduced. It was introduced to prevent the adoption of a services causing downtime by making sure the service was fully available. The change in this PR keeps that behavior for the adoption case but switches to only needing some availability for the cases that we have already adopted the service.

rollout/service.go Outdated Show resolved Hide resolved
Signed-off-by: zachaller <[email protected]>
@sonarcloud
Copy link

sonarcloud bot commented Dec 5, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.6% 0.6% Duplication

Copy link
Contributor

@leoluz leoluz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thank you!

@jstewart612
Copy link

@leoluz what else remains to get this merged. Any chance of it being considered a more urgent bug for a second point release rather than having to wait all the way until 1.4?

This has caused my organization two major production outages, and I'd like to not make it a third :)

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@jstewart612
Copy link

@jessesuen are you a more appropriate party to ask how fast-tracked this can be? This is a really fatal bug and tbqh I'd like to see an emerg 1.3.x process for this but idk how to go about asking for that here.

@leoluz
Copy link
Contributor

leoluz commented Dec 8, 2022

@jstewart612 I spoke with @zachaller about it and we can do a 1.3 patch release with this fix once we are ready.

If you can't wait for the official patch release, as an alternative, you could do an "in-house" release for your organization.

@jstewart612
Copy link

@leoluz :

I appreciate the response.

That's the aforementioned monkey-patching that I'd really like to avoid doing... but will if I must. Any idea of timetable on what was discussed on getting that done?

@zachaller
Copy link
Collaborator Author

zachaller commented Dec 8, 2022

@jstewart612 We are planning to following the timetable as close as possible for new releases such as 1.4 here as for the 1.3.2 patch release as soon as we hear back from the people at plaid that the fix does indeed seem solve the issue we can do a 1.3.2 release.

@jstewart612
Copy link

@zachaller I'm not them, much to my chagrin, but can confirm this resolves the issue in my local dev env testing.

@jstewart612
Copy link

@raxod502-plaid @jandersen-plaid are you the aforementioned Plaid people?

@raxod502-plaid
Copy link

@MarkSRobinson was testing this. Deployments take some time and the issue was never possible to reproduce on demand, so confirmation may take some time.

@jstewart612
Copy link

jstewart612 commented Dec 8, 2022

@raxod502-plaid @MarkSRobinson considering argoproj official maintainers seem to be holding on your word, could you speed that up?

@jandersen-plaid
Copy link

jandersen-plaid commented Dec 8, 2022

@raxod502-plaid @MarkSRobinson considering argoproj official maintainers seem to be holding on your word, could you speed that up?

If you would like to test this change you are welcome to -- usually this situation arises when there is a lack of nodes or scheduling issues, so you will just have to simulate nodes failing to come up or pods failing to schedule and hope that you hit this bug.

In terms of Plaid "speeding things up": if you believe this change will solve the issue then you are welcome to fork the project and apply this PR as a patch or make whatever changes you would like to solve this particular issue (considering this is an open source project). If you want guarantees about timely fixes then you would have to talk to the argo team who may or may not be able to offer you a support contract. For Plaid specifically, we are working at our own pace given internal priorities and resourcing and offer no guarantees about when we will be finished testing the issue.

@jstewart612
Copy link

@zachaller based on this #2441 (comment) it seems a bad idea to wait on resource-strapped reporters for QA who make no guarantees about their ability to follow up. What data do you require of me to validate the fix beyond my affirmation that it has done so in my environment?

@jstewart612
Copy link

jstewart612 commented Dec 8, 2022

and @jandersen-plaid don't worry: I know what monkey-patching is and my org is already doing it as a result of having to wait here. I would just like to stop doing it because of all the caveats of running in that state. As you rightly note though, you are neither the one deciding to wait on your word nor presently possessed of the resources to do this timely (empathy on that latter point!).

@zachaller
Copy link
Collaborator Author

zachaller commented Dec 8, 2022

@jstewart612 I would be very curious on how you tested this locally because part of the reason I am waiting on their response is I was never able to test this locally this code deals with a very specific window of time at the end of a rollout that is quite hard if not impossible to replicate locally hence I am curious on how you tested because it could allow me to create a e2e test for this issue. It would also give me more confidence in knowing how you reproduced and tested it. I do not plan on waiting for their response in order for this change to make it into the 1.4 release because I am still pretty confident that the fixed behavior is correct. This issue should also really mostly affect dynamicStableScale: true so you could turn that off as well and probably not experience downtime.

@jandersen-plaid Sorry to hear about the layoffs I hope it does not affect you and others that have helped the rollouts community

@jstewart612
Copy link

@zachaller any Rollout. dynamicStableScaling true. Change the image tag as to trigger a rollout. Happens to us every time that the Rollout entirely completes, but the rollouts-pod-template-hash on the stableService remains the old value indefinitely until we manually change it and I don't see anything in the logs indicating a failure or attempt to change this.

@jstewart612
Copy link

jstewart612 commented Dec 8, 2022

@zachaller also I have tested dynamicStableScale false. Now, I don't have downtime, but it takes over 35 minutes to scale down 25 pods and tons of update errors about replicaset being modified cannot apply show up in my logs. This, I am going to assume, is the real problem.

@zachaller
Copy link
Collaborator Author

zachaller commented Dec 8, 2022

@jstewart612 Yea I was going to say there must be something going on in your cluster that is causing the canary to never go 100% healthy causing the outage. With the very slow scale down it sounds like something is also updating your replicasets preventing rollouts controller from updating the replica counts.

@jstewart612
Copy link

jstewart612 commented Dec 8, 2022

@zachaller at the same time: you shouldn't be turning down, in the dynamicStableScale true case, the stable RS while under this condition, agreed (the action that causes the downtime)?

@jstewart612
Copy link

@zachaller well turns out it's Lacework Admissions Controller and Argo Rollouts conflicting with each other. I'll separately bug-ticket them about that.

dynamicStableScale true is still broken as previously claimed, even with this patch, but false does fine when someone else isn't jumping in and modifying objects out from under it :)

@zachaller
Copy link
Collaborator Author

dynamicStableScale true is still broken as previously claimed, even with this patch, but false does fine when someone else isn't jumping in and modifying objects out from under it :)

What makes you say this patch does not fix the dynamicStableScale: true case?

@jstewart612
Copy link

@zachaller The new canary RS was up with one or more pods. The Istio VirtualService was adjusted to weight one hundred percent to the canaryService (which was set as the final step of this particular Rollout). The stable RS then scaled down to zero. The Istio VirtualService then switched weights back to the stableService. The rollouts-pod-template-hash label selector on the stableService was still pointed at the old pod template hash it had just scaled to 0. Outage.

@harikrongali
Copy link
Contributor

@zachaller what is the risk of merging the code & releasing a patch?

@zachaller
Copy link
Collaborator Author

I plan on releasing a patch next week with or without an update from plaid, I am going to go through pr's and see if there are any other issues worth pulling in as well.

@zachaller zachaller merged commit b4c1d9e into argoproj:master Dec 12, 2022
zachaller added a commit that referenced this pull request Dec 13, 2022
… single pod failures (#2441)

* fix(traficrouter): WIP on not setting weight if not available

Signed-off-by: zachaller <[email protected]>

* fix tests

Signed-off-by: zachaller <[email protected]>

* try bailing vs setting weight

Signed-off-by: zachaller <[email protected]>

* work with expirments that do not set any weights

Signed-off-by: zachaller <[email protected]>

* fix test by commenting out code

Signed-off-by: zachaller <[email protected]>

* lint

Signed-off-by: zachaller <[email protected]>

* simplify logic

Signed-off-by: zachaller <[email protected]>

* switch logic

Signed-off-by: zachaller <[email protected]>

* add more comments

Signed-off-by: zachaller <[email protected]>

* add more comments

Signed-off-by: zachaller <[email protected]>

* add more test

Signed-off-by: zachaller <[email protected]>

* refactor test

Signed-off-by: zachaller <[email protected]>

* refactor code to reduce duplication

Signed-off-by: zachaller <[email protected]>

* change comments a bit

Signed-off-by: zachaller <[email protected]>

* remove else

Signed-off-by: zachaller <[email protected]>

Signed-off-by: zachaller <[email protected]>
tperdue321 pushed a commit to rallyhealth/argo-rollouts that referenced this pull request Jan 12, 2023
… single pod failures (argoproj#2441)

* fix(traficrouter): WIP on not setting weight if not available

Signed-off-by: zachaller <[email protected]>

* fix tests

Signed-off-by: zachaller <[email protected]>

* try bailing vs setting weight

Signed-off-by: zachaller <[email protected]>

* work with expirments that do not set any weights

Signed-off-by: zachaller <[email protected]>

* fix test by commenting out code

Signed-off-by: zachaller <[email protected]>

* lint

Signed-off-by: zachaller <[email protected]>

* simplify logic

Signed-off-by: zachaller <[email protected]>

* switch logic

Signed-off-by: zachaller <[email protected]>

* add more comments

Signed-off-by: zachaller <[email protected]>

* add more comments

Signed-off-by: zachaller <[email protected]>

* add more test

Signed-off-by: zachaller <[email protected]>

* refactor test

Signed-off-by: zachaller <[email protected]>

* refactor code to reduce duplication

Signed-off-by: zachaller <[email protected]>

* change comments a bit

Signed-off-by: zachaller <[email protected]>

* remove else

Signed-off-by: zachaller <[email protected]>

Signed-off-by: zachaller <[email protected]>
Signed-off-by: Travis Perdue <[email protected]>
tperdue321 pushed a commit to rallyhealth/argo-rollouts that referenced this pull request Jan 12, 2023
… single pod failures (argoproj#2441)

* fix(traficrouter): WIP on not setting weight if not available

Signed-off-by: zachaller <[email protected]>

* fix tests

Signed-off-by: zachaller <[email protected]>

* try bailing vs setting weight

Signed-off-by: zachaller <[email protected]>

* work with expirments that do not set any weights

Signed-off-by: zachaller <[email protected]>

* fix test by commenting out code

Signed-off-by: zachaller <[email protected]>

* lint

Signed-off-by: zachaller <[email protected]>

* simplify logic

Signed-off-by: zachaller <[email protected]>

* switch logic

Signed-off-by: zachaller <[email protected]>

* add more comments

Signed-off-by: zachaller <[email protected]>

* add more comments

Signed-off-by: zachaller <[email protected]>

* add more test

Signed-off-by: zachaller <[email protected]>

* refactor test

Signed-off-by: zachaller <[email protected]>

* refactor code to reduce duplication

Signed-off-by: zachaller <[email protected]>

* change comments a bit

Signed-off-by: zachaller <[email protected]>

* remove else

Signed-off-by: zachaller <[email protected]>

Signed-off-by: zachaller <[email protected]>
Signed-off-by: Travis Perdue <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants