Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy for Balancing Pods across topology domains #146

Closed
krmayankk opened this issue May 7, 2019 · 45 comments · Fixed by #413
Closed

Policy for Balancing Pods across topology domains #146

krmayankk opened this issue May 7, 2019 · 45 comments · Fixed by #413
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@krmayankk
Copy link

The currently well known topology domains labels on the nodes are

  • failure-domain.beta.kubernetes.io/region
  • failure-domain.beta.kubernetes.io/zone

Descheduler currently supports the following options: RemoveDuplicates , LowNodeUtilization, RemovePodsViolatingInterPodAntiAffinity and RemovePodsViolatingNodeAffinity. None of these options support balancing pods across topology domains. The closest one to mimic this behavior would be LowNodeUtilization which basically based on two thresholds, evicts pods which don't fall between a range but is sub optimal.

Currently there is a KEP for even pod spreading, but that will take some time to come in. In the meantime , we need a policy in descheduler which does the following:-

  • forms a topology domain which is combination of region and zone
  • figures out nodes in the same bucket by using the bucket key in step 1
  • selects which pods this policy applies based on a labelSelector
  • finds buckets, where the difference in the number of pods(selected by labelSelector) in that bucket and any other bucket is maximum, and starts evicting pods in that bucket in the hopes that it will schedule on a node in a different bucket
@krmayankk
Copy link
Author

krmayankk commented May 7, 2019

@krmayankk
Copy link
Author

/assign @krmayankk

@bsalamat
Copy link

bsalamat commented May 7, 2019

@krmayankk As you know, we are adding even pod spreading in 1.15. Given that effort, I don't think it makes sense to add the same (or a very similar) feature in parallel to another component. If timing is a concern, you may want to fork a component and customize it for you use-cases in the meantime that we release even pod spreading.
That being said, it makes sense to add support for even pod spreading to Descheduler. The scheduler enforces even pod spreading at the time of scheduling pods. Descheduler can enforce the feature at run-time. If we go with this approach, we should ensure that Descheduler uses the same API and does not add its own API for the feature.

@krmayankk
Copy link
Author

I don't think it makes sense to add the same (or a very similar) feature in parallel to another component.

Not sure i understand. descheduler will just evict if the pod imbalance is detected. Are you against adding this in descheduler ?

Descheduler can enforce the feature at run-time. If we go with this approach, we should ensure that Descheduler uses the same API and does not add its own API for the feature.

Descheduler will just do this at runtime based on a new policy. Not sure i understand about the same API, since descheduler cannot work on an API that is being still worked on . May you are saying, that later descheduler can be modified to understand the pod topology fields being introduced in the even pod spreading . Is my understanding correct ?

@ravisantoshgudimetla
Copy link
Contributor

May you are saying, that later descheduler can be modified to understand the pod topology fields being introduced in the even pod spreading . Is my understanding correct ?

I think, that's what @bsalamat meant but I will wait for his comments. Tbh I am wondering do we need to use the API? We can obviously vendor the scheduler code and use it once we have functions that are specific to pod spreading.

@bsalamat
Copy link

Descheduler will just do this at runtime based on a new policy. Not sure i understand about the same API, since descheduler cannot work on an API that is being still worked on . May you are saying, that later descheduler can be modified to understand the pod topology fields being introduced in the even pod spreading . Is my understanding correct ?

Your understanding is right. We don't want to introduce a new API just for Descheduler and change it in the future.

@krmayankk
Copy link
Author

krmayankk commented May 16, 2019

@bsalamat Since the new features for even spreading has an api which is dependent on fields in the pod, we cannot use it now . So we would have to introduce a new policy in descheduler which is similar in semantics to the pod fields being introduced in even pod spreading . Once that feature is readily available we can converge . The whole point of descheduler being out of core is that we can easily adapt and not wait for the core changes which in this case will take many months to even come up as beta.

The descheduler new policy could still utilize the following api for determining which pods to delete and when the pods fields are available, we just need to make it use the pod fields.

TopologySpreadConstraints: []api.TopologySpreadConstraint{
					{
						MaxSkew:           1,
						TopologyKey:       "k8s.io/hostname",
						WhenUnsatisfiable: api.DoNotSchedule,
					},
				},

@bsalamat
Copy link

It is really up to the maintainers of the Descheduler to decide. Since our incubator projects have external users, you should be careful with API changes IMO. At lease, you should clearly mark such new API as experimental.

@krmayankk
Copy link
Author

i have started working on this PR and will get this up by friday.

@krmayankk
Copy link
Author

For anyone following along, the initial api and the initial implementation is in the PR #154

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2019
@rhockenbury
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2019
@axot
Copy link

axot commented Aug 30, 2019

Hi folks, we faced the same issue pods do not balance cross ability zones after rolling update. In 1.15 Kubernetes introduce new Even Pods Spread to solve this issue, is there any solution that could apply in an old version of Kubernetes?

@Huang-Wei
Copy link

@axot The "EvenPodsSpread" feature is postponed to be available in 1.16 as an alpha feature.

BTW: if your requirement of even spread upon scheduling time, or runtime (i.e. after the initial scheduling decision was made), or both?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 2, 2020
@seanmalloy
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 2, 2020
@stephan2012
Copy link
Contributor

stephan2012 commented Jan 17, 2020

After performing some availability tests (power-off an entire availability zone) we frequently end up with all Deployment Pods running in the surviving site (third availability zone is master-only). So having topology-based descheduling would greatly help assuring application availability.

@seanmalloy
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 6, 2020
@Huang-Wei
Copy link

@krmayankk Any update on this issue?

@Huang-Wei
Copy link

BTW: PodTopologySpread (a.k.a EvenPodsSpread) will be promoted to beta in 1.18 in k/k/

@seanmalloy
Copy link
Member

/unassign @krmayankk

@krmayankk
Copy link
Author

@seanmalloy @Huang-Wei are we taking the same approach as the PR i had started ? Is there any work happening in k/k to solve runtime scheduling anytime soon ?

@Huang-Wei
Copy link

@krmayankk Yes, we will basically be following the same approach. It's just some data structures and logic have been optimized which may be able to reused here.

And there is no significant work on runtime scheduling yet.

@seanmalloy
Copy link
Member

@damemi @aveshagarwal @ravisantoshgudimetla @Huang-Wei @krmayankk please let me know if you have any thoughts on the below proposed API changes for this strategy. Thanks!

Here is the original API proposed by @krmayankk. I noticed that there are parameters for namespace and label selector. At this point in time the descheduler does not have any options for selecting pods by namespace or label.

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "TopologySpreadConstraint":
     enabled: true
     params:
        namespacedtopologyspreadconstraints:
         - namespace: sam-system
           topologyspreadconstraints:
            - maxSkew: 1
              topologyKey: failure-domain.beta.kubernetes.io/zone
              labelSelector:
                      matchLabels:
                              apptype: server

I propose removing the parameters for namespace and label selector. Something like this:

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "TopologySpreadConstraint":
     enabled: true
     params:
       topologyspreadconstraints:
       - maxSkew: 1
         topologyKey: topology.kubernetes.io/zone

I do believe that it would be useful for the descheduler to restrict the pods it considers for eviction by namespace and label selector, but in my opinion these should be handled separately. See #251 for namespaces and #195 for label selectors.

@seanmalloy
Copy link
Member

Doing a bit more thinking it might be possible to create this strategy without any parameters. Just use the maxSkew, topologyKey and labelSelector from the running pods, https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/.

So all that might be needed would be:

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "TopologySpreadConstraint":
     enabled: true

@Huang-Wei
Copy link

@seanmalloy you're right, the old API was proposed in the context of co-existing with PodTopologySpread - by then it was still alpha. Now PodTopologySpread is going to be GA, or at least beta in 1.18, you can leverage the existing canonical API fields for sure.

@seanmalloy
Copy link
Member

I believe this strategy should be named RemovePodsViolatingTopologySpreadConstraint
instead of TopologySpreadConstraint.

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingTopologySpreadConstraint":
     enabled: true

@damemi
Copy link
Contributor

damemi commented Jun 26, 2020

Can we still target this for 1.19 descheduler @seanmalloy? @Huang-Wei PodTopologySpread was graduated to Beta right?

@seanmalloy
Copy link
Member

Can we still target this for 1.19 descheduler @seanmalloy?

@damemi yep ... i have a branch with this feature here ... https://github.com/KohlsTechnology/descheduler/tree/evenpod. The code doesn't work yet, but it does compile. :-) I believe I can submit a PR for review in a few weeks.

@Huang-Wei
Copy link

@damemi yes, it was beta in 1.18, and will be ga in 1.19.

BTW: is descheduler going to release a 1.19 version when k8s v1.19.0 is out?

@seanmalloy
Copy link
Member

BTW: is descheduler going to release a 1.19 version when k8s v1.19.0 is out?

@Huang-Wei yes that is the plan. Going forward the descheduler should release a new version shortly after each k8s minor release.

@krmayankk
Copy link
Author

@seanmalloy @Huang-Wei is this in progress , can you link a PR here ?

@seanmalloy
Copy link
Member

@krmayankk I'm hoping to open a pull request for this by the end of the week. My intent is to have the new strategy available for descheduler release v0.19.0 which should be released sometime in August.

Here is the unfinished code that I'll be working towards completing this week:
https://github.com/KohlsTechnology/descheduler/tree/evenpod

@seanmalloy
Copy link
Member

Still no pull request. I'll starting working on this again the week of August 10th. Sorry for the delay.

@damemi
Copy link
Contributor

damemi commented Aug 3, 2020

Still no pull request. I'll starting working on this again the week of August 10th. Sorry for the delay.

1.19 GA is planned for August 25th, so I think as long as we have it before then it can make our 1.19 cut

@seanmalloy
Copy link
Member

Sorry but I don't have time to finish this. See details in WIP pull request #383.

/unassign
/help

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 28, 2020
@damemi
Copy link
Contributor

damemi commented Aug 31, 2020

/assign
/remove-help

I think I can pick this one up. Thanks for starting the work @seanmalloy!

@k8s-ci-robot k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 31, 2020
@damemi
Copy link
Contributor

damemi commented Sep 30, 2020

New PR open here: #413. Please provide feedback to help optimize what I've got!

@seanmalloy
Copy link
Member

This feature will ship as part of descheduler release v0.20.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet