Policy for Balancing Pods across topology domains #146

krmayankk · 2019-05-07T20:59:18Z

The currently well known topology domains labels on the nodes are

failure-domain.beta.kubernetes.io/region
failure-domain.beta.kubernetes.io/zone

Descheduler currently supports the following options: RemoveDuplicates , LowNodeUtilization, RemovePodsViolatingInterPodAntiAffinity and RemovePodsViolatingNodeAffinity. None of these options support balancing pods across topology domains. The closest one to mimic this behavior would be LowNodeUtilization which basically based on two thresholds, evicts pods which don't fall between a range but is sub optimal.

Currently there is a KEP for even pod spreading, but that will take some time to come in. In the meantime , we need a policy in descheduler which does the following:-

forms a topology domain which is combination of region and zone
figures out nodes in the same bucket by using the bucket key in step 1
selects which pods this policy applies based on a labelSelector
finds buckets, where the difference in the number of pods(selected by labelSelector) in that bucket and any other bucket is maximum, and starts evicting pods in that bucket in the hopes that it will schedule on a node in a different bucket

The text was updated successfully, but these errors were encountered:

krmayankk · 2019-05-07T21:01:09Z

FYI @ravisantoshgudimetla @bsalamat @Huang-Wei @aveshagarwal

krmayankk · 2019-05-07T21:04:11Z

/assign @krmayankk

bsalamat · 2019-05-07T23:47:22Z

@krmayankk As you know, we are adding even pod spreading in 1.15. Given that effort, I don't think it makes sense to add the same (or a very similar) feature in parallel to another component. If timing is a concern, you may want to fork a component and customize it for you use-cases in the meantime that we release even pod spreading.
That being said, it makes sense to add support for even pod spreading to Descheduler. The scheduler enforces even pod spreading at the time of scheduling pods. Descheduler can enforce the feature at run-time. If we go with this approach, we should ensure that Descheduler uses the same API and does not add its own API for the feature.

krmayankk · 2019-05-08T06:14:23Z

I don't think it makes sense to add the same (or a very similar) feature in parallel to another component.

Not sure i understand. descheduler will just evict if the pod imbalance is detected. Are you against adding this in descheduler ?

Descheduler can enforce the feature at run-time. If we go with this approach, we should ensure that Descheduler uses the same API and does not add its own API for the feature.

Descheduler will just do this at runtime based on a new policy. Not sure i understand about the same API, since descheduler cannot work on an API that is being still worked on . May you are saying, that later descheduler can be modified to understand the pod topology fields being introduced in the even pod spreading . Is my understanding correct ?

ravisantoshgudimetla · 2019-05-15T20:58:28Z

May you are saying, that later descheduler can be modified to understand the pod topology fields being introduced in the even pod spreading . Is my understanding correct ?

I think, that's what @bsalamat meant but I will wait for his comments. Tbh I am wondering do we need to use the API? We can obviously vendor the scheduler code and use it once we have functions that are specific to pod spreading.

bsalamat · 2019-05-15T21:40:02Z

Descheduler will just do this at runtime based on a new policy. Not sure i understand about the same API, since descheduler cannot work on an API that is being still worked on . May you are saying, that later descheduler can be modified to understand the pod topology fields being introduced in the even pod spreading . Is my understanding correct ?

Your understanding is right. We don't want to introduce a new API just for Descheduler and change it in the future.

krmayankk · 2019-05-16T02:28:31Z

@bsalamat Since the new features for even spreading has an api which is dependent on fields in the pod, we cannot use it now . So we would have to introduce a new policy in descheduler which is similar in semantics to the pod fields being introduced in even pod spreading . Once that feature is readily available we can converge . The whole point of descheduler being out of core is that we can easily adapt and not wait for the core changes which in this case will take many months to even come up as beta.

The descheduler new policy could still utilize the following api for determining which pods to delete and when the pods fields are available, we just need to make it use the pod fields.

TopologySpreadConstraints: []api.TopologySpreadConstraint{
					{
						MaxSkew:           1,
						TopologyKey:       "k8s.io/hostname",
						WhenUnsatisfiable: api.DoNotSchedule,
					},
				},

bsalamat · 2019-05-16T20:40:14Z

It is really up to the maintainers of the Descheduler to decide. Since our incubator projects have external users, you should be careful with API changes IMO. At lease, you should clearly mark such new API as experimental.

krmayankk · 2019-05-21T22:41:56Z

i have started working on this PR and will get this up by friday.

krmayankk · 2019-05-29T00:35:19Z

For anyone following along, the initial api and the initial implementation is in the PR #154

fejta-bot · 2019-08-27T00:38:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

rhockenbury · 2019-08-27T02:08:47Z

/remove-lifecycle stale

axot · 2019-08-30T15:07:43Z

Hi folks, we faced the same issue pods do not balance cross ability zones after rolling update. In 1.15 Kubernetes introduce new Even Pods Spread to solve this issue, is there any solution that could apply in an old version of Kubernetes?

Huang-Wei · 2019-09-04T04:46:23Z

@axot The "EvenPodsSpread" feature is postponed to be available in 1.16 as an alpha feature.

BTW: if your requirement of even spread upon scheduling time, or runtime (i.e. after the initial scheduling decision was made), or both?

fejta-bot · 2019-12-03T04:59:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-01-02T05:44:59Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

seanmalloy · 2020-01-02T18:35:11Z

/remove-lifecycle rotten

stephan2012 · 2020-01-17T11:42:59Z

After performing some availability tests (power-off an entire availability zone) we frequently end up with all Deployment Pods running in the surviving site (third availability zone is master-only). So having topology-based descheduling would greatly help assuring application availability.

seanmalloy · 2020-02-06T03:29:42Z

/kind feature

Huang-Wei · 2020-02-06T17:32:26Z

@krmayankk Any update on this issue?

Huang-Wei · 2020-02-06T17:32:48Z

BTW: PodTopologySpread (a.k.a EvenPodsSpread) will be promoted to beta in 1.18 in k/k/

seanmalloy · 2020-05-21T20:54:55Z

/unassign @krmayankk

krmayankk · 2020-05-27T17:10:02Z

@seanmalloy @Huang-Wei are we taking the same approach as the PR i had started ? Is there any work happening in k/k to solve runtime scheduling anytime soon ?

Huang-Wei · 2020-05-27T17:57:32Z

@krmayankk Yes, we will basically be following the same approach. It's just some data structures and logic have been optimized which may be able to reused here.

And there is no significant work on runtime scheduling yet.

seanmalloy · 2020-06-02T05:32:22Z

@damemi @aveshagarwal @ravisantoshgudimetla @Huang-Wei @krmayankk please let me know if you have any thoughts on the below proposed API changes for this strategy. Thanks!

Here is the original API proposed by @krmayankk. I noticed that there are parameters for namespace and label selector. At this point in time the descheduler does not have any options for selecting pods by namespace or label.

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "TopologySpreadConstraint":
     enabled: true
     params:
        namespacedtopologyspreadconstraints:
         - namespace: sam-system
           topologyspreadconstraints:
            - maxSkew: 1
              topologyKey: failure-domain.beta.kubernetes.io/zone
              labelSelector:
                      matchLabels:
                              apptype: server

I propose removing the parameters for namespace and label selector. Something like this:

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "TopologySpreadConstraint":
     enabled: true
     params:
       topologyspreadconstraints:
       - maxSkew: 1
         topologyKey: topology.kubernetes.io/zone

I do believe that it would be useful for the descheduler to restrict the pods it considers for eviction by namespace and label selector, but in my opinion these should be handled separately. See #251 for namespaces and #195 for label selectors.

seanmalloy · 2020-06-02T06:11:09Z

Doing a bit more thinking it might be possible to create this strategy without any parameters. Just use the maxSkew, topologyKey and labelSelector from the running pods, https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/.

So all that might be needed would be:

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "TopologySpreadConstraint":
     enabled: true

Huang-Wei · 2020-06-02T19:45:51Z

@seanmalloy you're right, the old API was proposed in the context of co-existing with PodTopologySpread - by then it was still alpha. Now PodTopologySpread is going to be GA, or at least beta in 1.18, you can leverage the existing canonical API fields for sure.

seanmalloy · 2020-06-05T06:01:58Z

I believe this strategy should be named RemovePodsViolatingTopologySpreadConstraint
instead of TopologySpreadConstraint.

---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "RemovePodsViolatingTopologySpreadConstraint":
     enabled: true

damemi · 2020-06-26T20:43:22Z

Can we still target this for 1.19 descheduler @seanmalloy? @Huang-Wei PodTopologySpread was graduated to Beta right?

seanmalloy · 2020-06-26T21:03:46Z

Can we still target this for 1.19 descheduler @seanmalloy?

@damemi yep ... i have a branch with this feature here ... https://github.com/KohlsTechnology/descheduler/tree/evenpod. The code doesn't work yet, but it does compile. :-) I believe I can submit a PR for review in a few weeks.

Huang-Wei · 2020-06-27T03:56:02Z

@damemi yes, it was beta in 1.18, and will be ga in 1.19.

BTW: is descheduler going to release a 1.19 version when k8s v1.19.0 is out?

seanmalloy · 2020-06-27T04:31:03Z

BTW: is descheduler going to release a 1.19 version when k8s v1.19.0 is out?

@Huang-Wei yes that is the plan. Going forward the descheduler should release a new version shortly after each k8s minor release.

krmayankk · 2020-07-20T18:08:53Z

@seanmalloy @Huang-Wei is this in progress , can you link a PR here ?

seanmalloy · 2020-07-20T18:37:56Z

@krmayankk I'm hoping to open a pull request for this by the end of the week. My intent is to have the new strategy available for descheduler release v0.19.0 which should be released sometime in August.

Here is the unfinished code that I'll be working towards completing this week:
https://github.com/KohlsTechnology/descheduler/tree/evenpod

seanmalloy · 2020-08-01T04:04:52Z

Still no pull request. I'll starting working on this again the week of August 10th. Sorry for the delay.

damemi · 2020-08-03T13:41:38Z

Still no pull request. I'll starting working on this again the week of August 10th. Sorry for the delay.

1.19 GA is planned for August 25th, so I think as long as we have it before then it can make our 1.19 cut

seanmalloy · 2020-08-28T04:27:19Z

Sorry but I don't have time to finish this. See details in WIP pull request #383.

/unassign
/help

damemi · 2020-08-31T13:39:29Z

/assign
/remove-help

I think I can pick this one up. Thanks for starting the work @seanmalloy!

damemi · 2020-09-30T18:31:20Z

New PR open here: #413. Please provide feedback to help optimize what I've got!

seanmalloy · 2020-11-13T19:25:00Z

This feature will ship as part of descheduler release v0.20.0.

k8s-ci-robot assigned krmayankk May 7, 2019

krmayankk mentioned this issue May 23, 2019

Runtime Even Pod Spreading #154

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 3, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 2, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 2, 2020

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 6, 2020

k8s-ci-robot unassigned krmayankk May 21, 2020

seanmalloy mentioned this issue May 30, 2020

Kubernetes 1.19 Release Cycle #284

Closed

4 tasks

seanmalloy mentioned this issue Aug 11, 2020

Add ability to optimise for Zonal Distribution #368

Closed

seanmalloy mentioned this issue Aug 21, 2020

WIP - New Strategy RemovePodsViolatingTopologySpreadConstraint #383

Closed

3 tasks

k8s-ci-robot unassigned seanmalloy Aug 28, 2020

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 28, 2020

k8s-ci-robot assigned damemi Aug 31, 2020

k8s-ci-robot removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 31, 2020

seanmalloy mentioned this issue Sep 15, 2020

Kubernetes 1.20 Release Cycle #400

Closed

damemi mentioned this issue Sep 30, 2020

Add PodTopologySpread strategy #413

Merged

k8s-ci-robot closed this as completed in #413 Nov 13, 2020

Policy for Balancing Pods across topology domains #146

Policy for Balancing Pods across topology domains #146

Comments

krmayankk commented May 7, 2019

krmayankk commented May 7, 2019 • edited Loading

krmayankk commented May 7, 2019

bsalamat commented May 7, 2019

krmayankk commented May 8, 2019

ravisantoshgudimetla commented May 15, 2019

bsalamat commented May 15, 2019

krmayankk commented May 16, 2019 • edited Loading

bsalamat commented May 16, 2019

krmayankk commented May 21, 2019

krmayankk commented May 29, 2019

fejta-bot commented Aug 27, 2019

rhockenbury commented Aug 27, 2019

axot commented Aug 30, 2019

Huang-Wei commented Sep 4, 2019

fejta-bot commented Dec 3, 2019

fejta-bot commented Jan 2, 2020

seanmalloy commented Jan 2, 2020

stephan2012 commented Jan 17, 2020 • edited Loading

seanmalloy commented Feb 6, 2020

Huang-Wei commented Feb 6, 2020

Huang-Wei commented Feb 6, 2020

seanmalloy commented May 21, 2020

krmayankk commented May 27, 2020

Huang-Wei commented May 27, 2020

seanmalloy commented Jun 2, 2020

seanmalloy commented Jun 2, 2020

Huang-Wei commented Jun 2, 2020

seanmalloy commented Jun 5, 2020

damemi commented Jun 26, 2020

seanmalloy commented Jun 26, 2020

Huang-Wei commented Jun 27, 2020

seanmalloy commented Jun 27, 2020

krmayankk commented Jul 20, 2020

seanmalloy commented Jul 20, 2020

seanmalloy commented Aug 1, 2020

damemi commented Aug 3, 2020

seanmalloy commented Aug 28, 2020

damemi commented Aug 31, 2020

damemi commented Sep 30, 2020

seanmalloy commented Nov 13, 2020

krmayankk commented May 7, 2019 •

edited

Loading

krmayankk commented May 16, 2019 •

edited

Loading

stephan2012 commented Jan 17, 2020 •

edited

Loading