Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime Even Pod Spreading #154

Closed
wants to merge 3 commits into from

Conversation

krmayankk
Copy link

@krmayankk krmayankk commented May 23, 2019

Fixes #146

API based on KEP

The Descheduler policy is basically TopologyConstraints per namespace. Its described as follows:

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "TopologySpreadConstraint":
     enabled: true
     params:
        namespacedtopologyspreadconstraints:
         - namespace: sam-system
           topologyspreadconstraints:
            - maxSkew: 1
              topologyKey: failure-domain.beta.kubernetes.io/zone
              labelSelector:
                      matchLabels:
                              apptype: server

@k8s-ci-robot
Copy link
Contributor

Welcome @krmayankk!

It looks like this is your first PR to kubernetes-incubator/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-incubator/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 23, 2019
@krmayankk
Copy link
Author

FYI @ravisantoshgudimetla @bsalamat @Huang-Wei @aveshagarwal this PR is to initiate API discussion

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 23, 2019
@krmayankk krmayankk force-pushed the evenpod branch 2 times, most recently from c94a2a8 to 85b0603 Compare May 29, 2019 00:25
@krmayankk
Copy link
Author

@k8s-ci-robot
Copy link
Contributor

@krmayankk: GitHub didn't allow me to assign the following users: Huang-Wei.

Note that only kubernetes-incubator members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @bsalamat @ravisantoshgudimetla @Huang-Wei

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -0,0 +1,150 @@
/*
Copyright 2017 The Kubernetes Authors.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Add UT once we reach consensus on the api

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to 2019..

// scheduling it onto zone1(zone2) would make the ActualSkew(2) violate MaxSkew(1)
// - if MaxSkew is 2, incoming pod can be scheduled to any zone.
// It's a required value. Default value is 1 and 0 is not allowed.
MaxSkew int32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei - I believe this is inline with what you're proposing...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@krmayankk if the upstream API is available, I think you can simply vendor that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the descheduler is looking at TopologySpreadConstraint's as defined in DeschedulerPolicy. Later these constraints will come directly from the Pod itself. So there is nothing to vendor, just the source of the TopologySpreadConstraint will change @Huang-Wei

return
}

fmt.Printf("Found following parameters for TopologySpreadConstraint %v\n", strategy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to klog or glog..

constraint api.NamespacedTopologySpreadConstraint) {

if len(constraint.TopologySpreadConstraints) != 1 {
glog.V(1).Infof("We currently only support 1 topology spread constraint per namespace")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we need to explicitly document this along with reason.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. This is a significant limitation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm lean to delivering a full implementation which respects all constraints.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Huang-Wei so per namespace, do an AND of all constraints ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krmayankk Yes.

// does this pod labels match the constraint label selector
// TODO: This is intentional that it only looks at the first constraint
selector, err := metav1.LabelSelectorAsSelector(constraint.TopologySpreadConstraints[0].LabelSelector)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log and continue..

continue
}
if !selector.Matches(labels.Set(pod.Labels)) {
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, if you think, you're going to log heavily, please increase the log-level or append all the failures and log them at the end

if int32(podsInTopo-minPodsForGivenTopo) >= constraint.TopologySpreadConstraints[0].MaxSkew {
//we need to evict maxSkew-(podsInTopo-minPodsForGivenTopo))
countToEvict := constraint.TopologySpreadConstraints[0].MaxSkew - int32(podsInTopo-minPodsForGivenTopo)
podsListToEvict := GetPodsToEvict(countToEvict, v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this public method? Do you want to expose it for testing?

}

// GetPodFullName returns a name that uniquely identifies a pod.
func GetPodFullName(pod *v1.Pod) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are we using this function?

if !selector.Matches(labels.Set(pod.Labels)) {
continue
}
// TODO: Need to determine if the topokey already present in the node or not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we need to address this as part of the PR.

@@ -60,11 +60,13 @@ func Run(rs *options.DeschedulerServer) error {
return nil
}

glog.V(1).Infof("Reached here \n")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove this upon merging.


type topologyPairSet map[topologyPair]struct{}

// finnd all nodes
Copy link

@Huang-Wei Huang-Wei Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reword the comments as well as following ones.

}

fmt.Printf("Found following parameters for TopologySpreadConstraint %v\n", strategy)
for _, topoConstraints := range strategy.Params.NamespacedTopologySpreadConstraints {
Copy link

@Huang-Wei Huang-Wei Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The topoConstraints is sort of "aggregated" topologyConstraints group by namespace? If so, suggest to rename NamespacedTopologySpreadConstraints to AggTopologySpreadConstraintsByNs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have described the DeschedulerPolicy in the description. Here is what it looks like: Basically this loop is just reading the per namespace constraints specified in the policy. May be TopologySpreadConstraintsPerNamespace ?

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
  "TopologySpreadConstraint":
     enabled: true
     params:
        namespacedtopologyspreadconstraints:
         - namespace: sam-system
           topologyspreadconstraints:
            - maxSkew: 1
              topologyKey: failure-domain.beta.kubernetes.io/zone
              labelSelector:
                      matchLabels:
                              apptype: server

@Huang-Wei
Copy link

Please remove pkg/descheduler/strategies/.pod_antiaffinity.go.swp.

topologyPairToPods := make(map[topologyPair]podSet)
for _, node := range nodes {
glog.V(1).Infof("Processing node: %#v\n", node.Name)
pods, err := podutil.ListEvictablePodsOnNode(client, node, false)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe clientset is able to filter pods by namespace. We can pass in namespace hence don't need to check if pod.Namespace != constraint.Namespace later.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listing pods for each node will cause N API requests (N=len(nodes)). It's better to change the logic to:

for each topologySpreadConstraint
    pre-fetch all qualified pods globally (or use similar cache in descheduler if appropriate)
    process the pods list:
	(1) filter out the pods whose node doesn't have needed topologyKey
	(2) so that we can get a map[nodeName]podSet, and also we know the minimum match number
	for nodeName, podSet := range map[nodeName]podSet
		needEvictNum := len(podSet) - minMatch - maxSkew
		if needEvictNum > 0
			evict needEvictNum pods from this Node
			Note: this math is sort of brute force, we can come up with better math later.
			For example, for a 5/1/3 cluster, and maxSkew is 1; the brute force math above will
			evict 3/0/1 pods from each topology, but if we consider the math "dynamically", we
			should only evict 2/0/0 pods, so that eventually it can become 3/3/3.

for _, node := range nodes {
glog.V(1).Infof("Processing node: %#v\n", node.Name)
pods, err := podutil.ListEvictablePodsOnNode(client, node, false)
if err != nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete? Should log and continue.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign ravisantoshgudimetla
You can assign the PR to them by writing /assign @ravisantoshgudimetla in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@krmayankk krmayankk changed the title Runtime Even Pod Spreadig: Api Discussion Runtime Even Pod Spreading Jul 17, 2019
@Huang-Wei
Copy link

/assign

@Huang-Wei
Copy link

@krmayankk I'm not sure I like the API design in policy "namespacedtopologyspreadconstraints".

I checked existing descheduler policy examples, and all of them offers pretty simple parameters to evict pods violating that kind of policy.

IMO we should offer a neat policy for EvenPodsSpread, i.e. just enabled or disable, or probably additionally provides an option like AntiAffinity in case upstream feature supports more v1.UnsatisfiableConstraintAction options:

https://github.com/kubernetes-incubator/descheduler/blob/9e28f0b362ea5afa6ef4ec15f95cd5fc7eaf108a/examples/node-affinity.yml#L4-L8

With current API, users must explicitly specify every topologySpreadConstraints which isn't practical.

So I want to stop reviewing here to get a concensus on the API first.

cc @ravisantoshgudimetla on the API design.

Copy link

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only looked at the API. I think it is generally fine. As discussed yesterday in the SIG meeting, we should keep the API as an alpha version and wait for feedback from our users before committing to backward compatibility and longer term support.

@@ -48,6 +48,9 @@ type DeschedulerStrategy struct {
type StrategyParameters struct {
NodeResourceUtilizationThresholds NodeResourceUtilizationThresholds
NodeAffinityType []string
// TopologySpreadConstraints describes how a group of pods should be spread across topology
// domains. Descheduler will use these constraints to decide which pods to evict.
NamespacedTopologySpreadConstraints []NamespacedTopologySpreadConstraint

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this name unnecessarily long? It feels a bit like adding part of its documentation to the name.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 11, 2019
@k8s-ci-robot
Copy link
Contributor

@krmayankk: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 10, 2020
@rhockenbury
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 24, 2020
@seanmalloy
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 8, 2020
@seanmalloy
Copy link
Member

@krmayankk are you planning to continue working on this pull request?

It would be great if you could rebase and resolve the merge conflicts. I believe the even pod spreading feature in the scheduler is being promoted to beta in the k8s v1.18, so this will be a very useful feature once k8s v1.18 is released.

@Huang-Wei
Copy link

@seanmalloy Correct, the PodTopologySpread (featuregate EvenPodsSpread) is gonna be beta in 1.18. So it makes great sense to vendor the 1.18 k/k codebase upon the implementation of this PR.

@seanmalloy
Copy link
Member

@krmayankk the master branch has been updated with the k/k v1.18 vendor dependencies. Are you planning on continuing to work on this pull request?

I'm willing to continue working on this feature and use your original commits as a starting point if you do not have time to complete this work.

Thanks!

@seanmalloy
Copy link
Member

/close

I started working on the updated code based on this PR. The new branch is is here: https://github.com/KohlsTechnology/descheduler/tree/evenpod. I'm hoping to submit a new PR in the next few weeks. See also, my comment regarding possible API changes #146 (comment).

@k8s-ci-robot
Copy link
Contributor

@seanmalloy: Closed this PR.

In response to this:

/close

I started working on the updated code based on this PR. The new branch is is here: https://github.com/KohlsTechnology/descheduler/tree/evenpod. I'm hoping to submit a new PR in the next few weeks. See also, my comment regarding possible API changes #146 (comment).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Policy for Balancing Pods across topology domains
8 participants