Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to specify a minimum number of pods per Replicaset #1779

Closed
ssanders1449 opened this issue Jan 16, 2022 · 11 comments
Closed

Provide a way to specify a minimum number of pods per Replicaset #1779

ssanders1449 opened this issue Jan 16, 2022 · 11 comments
Labels
enhancement New feature or request needs-follow-up Used when a maintainer needs to follow up no-issue-activity

Comments

@ssanders1449
Copy link
Contributor

ssanders1449 commented Jan 16, 2022

Summary

The issue that I was trying to solve with #1738 is when a horizontal pod autoscaler is controlling the #replicas in the rollout. I want the #canary pods to match the weight of the rollout, but I also want to ensure a minimum number pods (e.g. 2) to ensure high availability. Currently, I can only use setCanaryScale to set either a fixed number of pods or a percentage of the total #replicas I want to be able to set a percentage, but apply a minimum as well.

Also, if I am using Dynamic Stable Scale, I want a way to ensure that the stable replicaset never goes below a minimum (e.g. 2) number of pods to ensure high availability.

Use Cases

Assume a Horizontal Pod Autoscaler is controlling the #replicas in the rollout, with a min of 10 pods and a max of 2000. I do not know up-front how many pods will set in the rollout. So, lets say I am running a step with the canary weight of 5 percent. If the total number of pods is less than 20, that would only give me one canary pod and I will not have high availability. However, I can't use the 'replicas' option of setCanaryScale to set a fixed number of replicas, because if the HPA does scale up to 2000 pods, I want to have 100 canary pods.

A second, but related, use case is if I am using Dynamic Stable Scale, so the stable replicaset scales down when the canary scales up. If we have 4 replicas and the canary is at 50% we are OK (2 pods each for canary and stable). However, if we then go to 75% canary, we will be left with only one pod in the stable ReplicaSet and again lose high availability. Again, the total #replicas may be controlled by an HPA, so we don't know up front when creating this step how any replicas there will be,


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@ssanders1449 ssanders1449 added the enhancement New feature or request label Jan 16, 2022
ssanders1449 pushed a commit to ssanders1449/argo-rollouts that referenced this issue Mar 28, 2022
@ssanders1449 ssanders1449 changed the title Add 'min' option to setCanaryScale Add 'minReplicas' option to setCanaryScale Jul 27, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 60 days with no activity.

@alexef
Copy link
Member

alexef commented Nov 15, 2022

@zachaller so I see the reason not to add complexity to the controller and the Rollout spec in #1738 (comment)

Should a global flag (--canary-min-replicas) affecting all Rollouts be an acceptable compromise here?

@zachaller
Copy link
Collaborator

zachaller commented Nov 15, 2022

Could use a Pod Disruption Budget to do the exact same thing you should be able to use the spec.minAvailable option based on https://kubernetes.io/docs/tasks/run-application/configure-pdb/#arbitrary-controllers-and-selectors

fyi, not 100% sure it covers the use case but asking if it would work or not?

@zachaller
Copy link
Collaborator

@ssanders1449 any updates if you could use PDB to accomplish the same thing or not?

@ssanders1449
Copy link
Contributor Author

ssanders1449 commented Nov 19, 2022

@zachaller as far as I can see, setting a Pod Disruption Budget on a Rollout would count all of the pods in the Rollout (both stable and canary) when making that calculation - with no distinction between them. So, it really can't do what we want here.

This is actually one of the enhancements that I would like to see for argo-rollouts. Instead of defining a PDB separately from the Rollout (which will operate on all pods equally) we should add a 'pdb' section to the Rollout definition and have argo create/delete a PDB with those parameters whenever it creates/deletes a ReplicaSet. In other words, we will have a separate PDB per Replicaset. This will allow us to specify, for example a PDB with 'maxUnavailable' percentage per Replicaset.

Even if we had 'PDB per ReplicaSet' support, I would still say that we need 'min pods per Replicaset' support, because when using maxUnavailable in a PDB it will allow eviction of a single pod.

@ssanders1449
Copy link
Contributor Author

I created a new issue for adding support for a PDB per ReplicaSet (#2419), but as I said above, won't solve this issue if the PDB is defined using 'maxUnavailable'

@zachaller
Copy link
Collaborator

So I also was able to spend a little bit more time looking into this and this is a slight summary.

With rollouts ability to adjust labels on individual pods via:

spec:
  replicas: 10
  strategy:
    canary:
      canaryMetadata:
        annotations:
          role: c
        labels:
          role: c

      stableMetadata:
        annotations:
          role: s
        labels:
          role: s

I am pretty sure would allow you to use PDB on the canary only however the limitation will then apply that you can only use .spec.minAvailable from the PDB. I would be curious with that if it solves your immediate use for minAvailable. However that said I do think there is something value in being able to use all of the PDB features if we move it to the replicaset. This does seem to have to be implemented by rollouts controller though. I would want to spend a little more time thinking about any downsides of having PDB on the RS in regards to how rollouts works though.

@ssanders1449 ssanders1449 changed the title Add 'minReplicas' option to setCanaryScale Provide a way to specify a minimum number of pods per Replicaset Nov 22, 2022
@ssanders1449
Copy link
Contributor Author

ssanders1449 commented Nov 22, 2022

It seems that the premise about the PDB limitation was incorrect. We just performed a test on a Rollout where we defined a PDB with 'maxUnavailable: 10% (which is what we used for most of our pre-Rollout deployments), and it worked perfectly. Also, using the labels defined by canaryMetadata and stableMetadata we were also able to create 2 separate PDBs with 'maxUnavailable: 10%' and they both worked properly. So, I closed #2419.

We could do your proposed workaround using 'minAvailable: 50%'. It will protect a single pod while allowing evictions when there is more than one pod. We cannot use a percentage higher than 50% because otherwise it won't allow evictions even when there are two pods which can prevent cluster scaledown and cost money.

However, if we use 'minAvailable: 50% and have many pods, this can reduce capacity by as much as 50% when draining a lot of nodes - which might be a hard sell to our Devops teams.

Therefore, we prefer to use maxUnavailable 10% because that scales the 'disruptionsAllowed' field in the PDB automatically as the #pods scale up, while minimizing the impact on SLA when a lot of nodes are drained at once.

The downside of 'maxUnavailable: 10%' is that it only means anything when there are at least 2 pods. It allows eviction when there is only a single pod. This is why we originally came up with the requirement of having a way to specify a minimum #pods per replicaSet. We have three proposed implementations:

  1. add rollout.Spec.Strategy.Canary.MinPodsPerRS ssanders1449/argo-rollouts#1
  2. feat: Implement Issue #1779: Add MinReplicas option to SetCanaryScale #1938
  3. feat: Implement Issue #1779: add --canary-min-replicas command-line flag #2410

@zachaller
Copy link
Collaborator

Thank you for the great write up on the reasoning, it is also good to know that k8s docs about the PDB limitations do not apply here I am guessing its because the parent is still a replicaset. Also thanks for the 3 implementations give me just a bit to think think about what one makes the most sense.

@zachaller zachaller added the needs-follow-up Used when a maintainer needs to follow up label Nov 27, 2022
@zachaller
Copy link
Collaborator

@ssanders1449 I think going with your option 1 is something we would be possibly interested in. If you could open a PR for that against rollouts so that we can run e2e etc etc.

zachaller added a commit that referenced this issue Dec 19, 2022
…PerReplicaSet (#2448)

* add rollout.Spec.Strategy.Canary.MinPodsPerRS (for TrafficRoutedCanary only)

Signed-off-by: Shlomo Sanders <[email protected]>

* fix lint and test

Signed-off-by: Shlomo Sanders <[email protected]>

* fix codegen

Signed-off-by: Shlomo Sanders <[email protected]>

* rename MinPodsPerRS to MinPodsPerReplicaSet

Signed-off-by: Shlomo Sanders <[email protected]>

* fix lint and codegen

Signed-off-by: Shlomo Sanders <[email protected]>

* update specification.md

Signed-off-by: Shlomo Sanders <[email protected]>

* codegen

Signed-off-by: zachaller <[email protected]>

* codegen missed a file

Signed-off-by: zachaller <[email protected]>

Signed-off-by: Shlomo Sanders <[email protected]>
Signed-off-by: zachaller <[email protected]>
Co-authored-by: Shlomo Sanders <[email protected]>
Co-authored-by: zachaller <[email protected]>
tperdue321 pushed a commit to rallyhealth/argo-rollouts that referenced this issue Jan 12, 2023
….MinPodsPerReplicaSet (argoproj#2448)

* add rollout.Spec.Strategy.Canary.MinPodsPerRS (for TrafficRoutedCanary only)

Signed-off-by: Shlomo Sanders <[email protected]>

* fix lint and test

Signed-off-by: Shlomo Sanders <[email protected]>

* fix codegen

Signed-off-by: Shlomo Sanders <[email protected]>

* rename MinPodsPerRS to MinPodsPerReplicaSet

Signed-off-by: Shlomo Sanders <[email protected]>

* fix lint and codegen

Signed-off-by: Shlomo Sanders <[email protected]>

* update specification.md

Signed-off-by: Shlomo Sanders <[email protected]>

* codegen

Signed-off-by: zachaller <[email protected]>

* codegen missed a file

Signed-off-by: zachaller <[email protected]>

Signed-off-by: Shlomo Sanders <[email protected]>
Signed-off-by: zachaller <[email protected]>
Co-authored-by: Shlomo Sanders <[email protected]>
Co-authored-by: zachaller <[email protected]>
tperdue321 pushed a commit to rallyhealth/argo-rollouts that referenced this issue Jan 12, 2023
….MinPodsPerReplicaSet (argoproj#2448)

* add rollout.Spec.Strategy.Canary.MinPodsPerRS (for TrafficRoutedCanary only)

Signed-off-by: Shlomo Sanders <[email protected]>

* fix lint and test

Signed-off-by: Shlomo Sanders <[email protected]>

* fix codegen

Signed-off-by: Shlomo Sanders <[email protected]>

* rename MinPodsPerRS to MinPodsPerReplicaSet

Signed-off-by: Shlomo Sanders <[email protected]>

* fix lint and codegen

Signed-off-by: Shlomo Sanders <[email protected]>

* update specification.md

Signed-off-by: Shlomo Sanders <[email protected]>

* codegen

Signed-off-by: zachaller <[email protected]>

* codegen missed a file

Signed-off-by: zachaller <[email protected]>

Signed-off-by: Shlomo Sanders <[email protected]>
Signed-off-by: zachaller <[email protected]>
Co-authored-by: Shlomo Sanders <[email protected]>
Co-authored-by: zachaller <[email protected]>
Signed-off-by: Travis Perdue <[email protected]>
@github-actions
Copy link
Contributor

This issue is stale because it has been open 60 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request needs-follow-up Used when a maintainer needs to follow up no-issue-activity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants