Add support for degradation #301

pmundt · 2020-05-28T01:01:54Z

The default behaviour is to not deschedule a Pod if it cannot be placed
anywhere else, this is fine for the default case where nodes are more
or less homogeneous, but it does not suit the case of heterogeneous
clusters where specific nodes may have their own unique hardware
configurations that are not be reproduced anywhere else within the
cluster.

In the case of heterogeneous accelerators, for example, there may be
pods that have a hard dependency on a specific resource (e.g. including
container runtimes geared at a specific accelerator), where allowing
them to continue running would produce undesired and unpredictable
behaviour. Consider the case of a Pod with a hard dependency on a
USB-attached accelerator, which may disappear during the lifecycle of
the Pod.

This adds a new --degradation-allowed flag that specifically allows
Pods to indicate when they should be descheduled after a label they
depend on suddenly goes away.

The default behaviour is to not deschedule a Pod if it cannot be placed anywhere else, this is fine for the default case where nodes are more or less homogeneous, but it does not suit the case of heterogeneous clusters where specific nodes may have their own unique hardware configurations that are not be reproduced anywhere else within the cluster. In the case of heterogeneous accelerators, for example, there may be pods that have a hard dependency on a specific resource (e.g. including container runtimes geared at a specific accelerator), where allowing them to continue running would produce undesired and unpredictable behaviour. Consider the case of a Pod with a hard dependency on a USB-attached accelerator, which may disappear during the lifecycle of the Pod. This adds a new `--degradation-allowed` flag that specifically allows Pods to indicate when they should be descheduled after a label they depend on suddenly goes away.

k8s-ci-robot · 2020-05-28T01:02:01Z

Welcome @pmundt!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-05-28T01:02:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pmundt
To complete the pull request process, please assign k82cn
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pmundt · 2020-05-28T01:02:53Z

/assign @k82cn

ingvagabund · 2020-05-28T09:27:19Z

pkg/descheduler/evictions/evictions.go

@@ -61,6 +63,7 @@ func NewPodEvictor(
 		client:             client,
 		policyGroupVersion: policyGroupVersion,
 		dryRun:             dryRun,
+		DegradationAllowed: degradationAllowed,


DegradationAllowed field is not used anywhere inside PodEvictor methods. Is this PR still WIP?

No, it's used by the eviction strategy, which is passed in the pod evictor. I opted to make this exportable so we could access the flag there, rather than having to modify every single eviction strategy callsite to add in another argument.

I see: podEvictor.DegradationAllowed. It's not a good practice.

rather than having to modify every single eviction strategy callsite to add in another argument

However, it's the right thing to do from the maintainability perspective. If you need to access DegradationAllowed through PodEvictor, introducing new method is more practical than accessing DegradationAllowed directly. Also, DegradationAllowed is not related to evicting pod itself, rather to relaxing constraint of selecting pods to be evicted.

I would argue that changing every function every time we add a new flag is worse from a maintainability point of view, but it's not my code base, so I'm happy to follow whatever convention people like. In terms of the degradation, you are correct, we are presently doing this through a global flag and applying it across all pods that satisfy the criteria. We could also consider pushing this down to the pod level through an additional annotation?

I would argue that changing every function every time we add a new flag

You are right. The strategyFunction data type is still evolving. I am fine with changing the signature [1] to:

type strategyFunction func( ctx context.Context, client clientset.Interface, strategy api.DeschedulerStrategy, nodes []*v1.Node, opts Options, podEvictor *evictions.PodEvictor, )

where Options can be defined as:

type Options struct { EvictLocalStoragePods bool DegradationAllowed bool }

as long as Options contains fields generic for any strategy.

[1] https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/descheduler.go#L63

We could also consider pushing this down to the pod level through an additional annotation?

If I understand the intention correctly, DegradationAllowed is a strategy level configuration. As such, the option needs to be passed into the strategy before it's ran.

@pmundt if this is only used by the nodeAffinity (and, in your new strategy nodeSelector), does it have to be a global setting? It could just be a StrategyParam field for those right?

As noted in the other PR (sorry for the run-around, I mention it here for posterity), we would also plan to evaluate it within IsEvictable in cases where we know the pod can be explicitly degraded. I'm therefore not sure that the StrategyParam would be sufficient, unless we're also able to access this from the pod package somehow. In terms of the global setting, I don't mind if we get rid of this and use a Pod annotation or similar, the important thing is simply that we have a mechanism to degrade specific node-local Pods - I'll defer to you on whichever option you find more palatable! If we leave it as an annotation, we could presumably also leave it as a StrategyParam, as the annotation could be tested independently within IsEvictable.

In terms of the global setting, I don't mind if we get rid of this and use a Pod annotation or similar, the important thing is simply that we have a mechanism to degrade specific node-local Pods

Is this something that the existing descheduler.alpha.kubernetes.io/evict annotation could solve? It is already checked in IsEvictable and effectively bypasses that check for specific pods.

If we leave it as an annotation, we could presumably also leave it as a StrategyParam, as the annotation could be tested independently within IsEvictable.

With the above annotation^ I think this is what you want. In the other PR thread you mentioned wanting to specifically bypass DaemonSets too, which that annotation supports.

Sorry if it seems like I'm being difficult, I just don't yet see the need to pass this information to every strategy.

Is this something that the existing descheduler.alpha.kubernetes.io/evict annotation could solve? It is already checked in IsEvictable and effectively bypasses that check for specific pods.

You're right, for some reason when I first took a look at the evict annotation I missed this. I think this would do the job, yes.

With the above annotation^ I think this is what you want. In the other PR thread you mentioned wanting to specifically bypass DaemonSets too, which that annotation supports.

Sorry if it seems like I'm being difficult, I just don't yet see the need to pass this information to every strategy.

No problem, it's not always obvious what the preferred direction is when twiddling in someone else's code base, and as it turns out, I misread what the evict annotation actually does, so I'm happy for another set of eyes while I come to grips with things.

If you're happy with the StrategyParam direction, I'll give this a go with the evict annotation and see how it goes.

Yes let's do that -- I agree that there are cases when these affinity strategies may want to evict even if the pod can't fit anywhere else, so adding another Param for these strategies makes sense to me.

I'm also assuming that if you enable that Param the user must know what they are doing, because as @ingvagabund pointed out if you're not properly handling such an evict-at-all-costs case you will end up with stuck Pending pods. I don't think that's the descheduler's concern though (as long as the possibility is clearly documented)

seanmalloy · 2020-05-29T04:15:53Z

/kind feature

seanmalloy · 2020-05-29T04:32:31Z

One thing I noticed is that this PR does not update any of the end user documentation. I believe the new --degradation-allowed CLI option should be documented in the file docs/user-guide.md.

pmundt · 2020-05-29T07:18:28Z

@seanmalloy good point, I've pushed out some documentation updates now.

ingvagabund

The default behaviour is to not deschedule a Pod if it cannot be placed
anywhere else, this is fine for the default case where nodes are more
or less homogeneous, but it does not suit the case of heterogeneous
clusters where specific nodes may have their own unique hardware
configurations that are not be reproduced anywhere else within the
cluster.

In the case of heterogeneous accelerators, for example, there may be
pods that have a hard dependency on a specific resource (e.g. including
container runtimes geared at a specific accelerator), where allowing
them to continue running would produce undesired and unpredictable
behaviour. Consider the case of a Pod with a hard dependency on a
USB-attached accelerator, which may disappear during the lifecycle of
the Pod.

IIUC, you have a set of pods which might lose one of necessary conditions for running properly (e.g. USB-attached accelerator disappeared) and you want descheduler to find such pods through e.g. pod violating node affinity/selector strategy with a promise the strategy will eventually pick all the pods and evict them. Though, given the chance of any of the pods violating the node affinity/selector is not high, you want to allow "degradation" to increase the likelihood. Is my understanding correct?

ingvagabund · 2020-06-02T14:12:53Z

pkg/descheduler/strategies/node_affinity.go

@@ -45,7 +45,7 @@ func RemovePodsViolatingNodeAffinity(ctx context.Context, client clientset.Inter

 				for _, pod := range pods {
 					if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil && pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
-						if !nodeutil.PodFitsCurrentNode(pod, node) && nodeutil.PodFitsAnyNode(pod, nodes) {
+						if !nodeutil.PodFitsCurrentNode(pod, node) && (nodeutil.PodFitsAnyNode(pod, nodes) || podEvictor.DegradationAllowed == true) {


You don't want to ignore nodeutil.PodFitsAnyNode since it contains must conditions for a pod getting descheduled. So even when you allow degradation your pod will stay in Pending state until a suitable node is found.

pmundt · 2020-06-02T20:44:31Z

Yes, that's basically correct. In our case we are running these Pods across multiple (heterogeneous) Edge gateways (in a vehicle fleet) in which we rely on feature discovery / labelling to determine the precise capabilities of the Gateway and to schedule the requisite Pods on it - in this case, DaemonSets for metric exporters and device plugins, and normal Deployments for Pods that are using a specific container runtime + inference model adapted for that specific accelerator. In the event when there is a hardware change at the Gateway level we basically want to immediately deschedule Pods that had a tight coupling to the hardware and schedule replacement Pods that use an alternative hardware configuration (we handle this separately, but I mention it so you get a better idea of the scenario).

The USB removal scenario is one example, the somewhat more frequent case we have to deal with is an accelerator suddenly becoming unavailable to the node because it's been tasked with a more "important" task (e.g. we may be running an inference model on a GPU for monitoring road conditions, but then someone in the backseat decides they want to play a game, so we need to free the resource up in a reasonable amount of time and shift on to something else that can take over the inference workload). I wrote a separate label monitor that can immediately trigger the descheduler cronjob on label changes (as well as notify our orchestrator), and this, together with the degradation mode seems to be working as we need it to.

pmundt · 2020-06-03T18:01:45Z

Based on discussion with @damemi, this PR has been superseded by #311.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 28, 2020

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 28, 2020

k8s-ci-robot requested review from aveshagarwal and seanmalloy May 28, 2020 01:02

k8s-ci-robot assigned k82cn May 28, 2020

ingvagabund reviewed May 28, 2020

View reviewed changes

pmundt mentioned this pull request May 28, 2020

Refactor strategy options for future extensibility #305

Closed

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 29, 2020

Document degradation behaviour

73371fa

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 29, 2020

pmundt mentioned this pull request May 29, 2020

Evict pods violating security practices #272

Closed

ingvagabund reviewed Jun 2, 2020

View reviewed changes

pmundt closed this Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for degradation #301

Add support for degradation #301

pmundt commented May 28, 2020

k8s-ci-robot commented May 28, 2020

k8s-ci-robot commented May 28, 2020

pmundt commented May 28, 2020

ingvagabund May 28, 2020

pmundt May 28, 2020

ingvagabund May 28, 2020

pmundt May 28, 2020

ingvagabund May 28, 2020

damemi Jun 1, 2020 •

edited

Loading

pmundt Jun 2, 2020 •

edited

Loading

damemi Jun 2, 2020 •

edited

Loading

pmundt Jun 2, 2020

damemi Jun 3, 2020

seanmalloy commented May 29, 2020

seanmalloy commented May 29, 2020

pmundt commented May 29, 2020

ingvagabund left a comment

ingvagabund Jun 2, 2020

pmundt commented Jun 2, 2020

pmundt commented Jun 3, 2020

Add support for degradation #301

Add support for degradation #301

Conversation

pmundt commented May 28, 2020

k8s-ci-robot commented May 28, 2020

k8s-ci-robot commented May 28, 2020

pmundt commented May 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damemi Jun 1, 2020 • edited Loading

Choose a reason for hiding this comment

pmundt Jun 2, 2020 • edited Loading

Choose a reason for hiding this comment

damemi Jun 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seanmalloy commented May 29, 2020

seanmalloy commented May 29, 2020

pmundt commented May 29, 2020

ingvagabund left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmundt commented Jun 2, 2020

pmundt commented Jun 3, 2020

damemi Jun 1, 2020 •

edited

Loading

pmundt Jun 2, 2020 •

edited

Loading

damemi Jun 2, 2020 •

edited

Loading