Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for degradation #301

Closed

Conversation

pmundt
Copy link

@pmundt pmundt commented May 28, 2020

The default behaviour is to not deschedule a Pod if it cannot be placed
anywhere else, this is fine for the default case where nodes are more
or less homogeneous, but it does not suit the case of heterogeneous
clusters where specific nodes may have their own unique hardware
configurations that are not be reproduced anywhere else within the
cluster.

In the case of heterogeneous accelerators, for example, there may be
pods that have a hard dependency on a specific resource (e.g. including
container runtimes geared at a specific accelerator), where allowing
them to continue running would produce undesired and unpredictable
behaviour. Consider the case of a Pod with a hard dependency on a
USB-attached accelerator, which may disappear during the lifecycle of
the Pod.

This adds a new --degradation-allowed flag that specifically allows
Pods to indicate when they should be descheduled after a label they
depend on suddenly goes away.

The default behaviour is to not deschedule a Pod if it cannot be placed
anywhere else, this is fine for the default case where nodes are more
or less homogeneous, but it does not suit the case of heterogeneous
clusters where specific nodes may have their own unique hardware
configurations that are not be reproduced anywhere else within the
cluster.

In the case of heterogeneous accelerators, for example, there may be
pods that have a hard dependency on a specific resource (e.g. including
container runtimes geared at a specific accelerator), where allowing
them to continue running would produce undesired and unpredictable
behaviour. Consider the case of a Pod with a hard dependency on a
USB-attached accelerator, which may disappear during the lifecycle of
the Pod.

This adds a new `--degradation-allowed` flag that specifically allows
Pods to indicate when they should be descheduled after a label they
depend on suddenly goes away.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 28, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @pmundt!

It looks like this is your first PR to kubernetes-sigs/descheduler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/descheduler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 28, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pmundt
To complete the pull request process, please assign k82cn
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pmundt
Copy link
Author

pmundt commented May 28, 2020

/assign @k82cn

@@ -61,6 +63,7 @@ func NewPodEvictor(
client: client,
policyGroupVersion: policyGroupVersion,
dryRun: dryRun,
DegradationAllowed: degradationAllowed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DegradationAllowed field is not used anywhere inside PodEvictor methods. Is this PR still WIP?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's used by the eviction strategy, which is passed in the pod evictor. I opted to make this exportable so we could access the flag there, rather than having to modify every single eviction strategy callsite to add in another argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see: podEvictor.DegradationAllowed. It's not a good practice.

rather than having to modify every single eviction strategy callsite to add in another argument

However, it's the right thing to do from the maintainability perspective. If you need to access DegradationAllowed through PodEvictor, introducing new method is more practical than accessing DegradationAllowed directly. Also, DegradationAllowed is not related to evicting pod itself, rather to relaxing constraint of selecting pods to be evicted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that changing every function every time we add a new flag is worse from a maintainability point of view, but it's not my code base, so I'm happy to follow whatever convention people like. In terms of the degradation, you are correct, we are presently doing this through a global flag and applying it across all pods that satisfy the criteria. We could also consider pushing this down to the pod level through an additional annotation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that changing every function every time we add a new flag

You are right. The strategyFunction data type is still evolving. I am fine with changing the signature [1] to:

type strategyFunction func(
	ctx context.Context,
	client clientset.Interface,
	strategy api.DeschedulerStrategy,
	nodes []*v1.Node,
	opts Options,
	podEvictor *evictions.PodEvictor,
)

where Options can be defined as:

type Options struct {
	EvictLocalStoragePods bool
	DegradationAllowed bool
}

as long as Options contains fields generic for any strategy.

[1] https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/descheduler.go#L63

We could also consider pushing this down to the pod level through an additional annotation?

If I understand the intention correctly, DegradationAllowed is a strategy level configuration. As such, the option needs to be passed into the strategy before it's ran.

Copy link
Contributor

@damemi damemi Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmundt if this is only used by the nodeAffinity (and, in your new strategy nodeSelector), does it have to be a global setting? It could just be a StrategyParam field for those right?

Copy link
Author

@pmundt pmundt Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the other PR (sorry for the run-around, I mention it here for posterity), we would also plan to evaluate it within IsEvictable in cases where we know the pod can be explicitly degraded. I'm therefore not sure that the StrategyParam would be sufficient, unless we're also able to access this from the pod package somehow. In terms of the global setting, I don't mind if we get rid of this and use a Pod annotation or similar, the important thing is simply that we have a mechanism to degrade specific node-local Pods - I'll defer to you on whichever option you find more palatable! If we leave it as an annotation, we could presumably also leave it as a StrategyParam, as the annotation could be tested independently within IsEvictable.

Copy link
Contributor

@damemi damemi Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the global setting, I don't mind if we get rid of this and use a Pod annotation or similar, the important thing is simply that we have a mechanism to degrade specific node-local Pods

Is this something that the existing descheduler.alpha.kubernetes.io/evict annotation could solve? It is already checked in IsEvictable and effectively bypasses that check for specific pods.

If we leave it as an annotation, we could presumably also leave it as a StrategyParam, as the annotation could be tested independently within IsEvictable.

With the above annotation^ I think this is what you want. In the other PR thread you mentioned wanting to specifically bypass DaemonSets too, which that annotation supports.

Sorry if it seems like I'm being difficult, I just don't yet see the need to pass this information to every strategy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that the existing descheduler.alpha.kubernetes.io/evict annotation could solve? It is already checked in IsEvictable and effectively bypasses that check for specific pods.

You're right, for some reason when I first took a look at the evict annotation I missed this. I think this would do the job, yes.

With the above annotation^ I think this is what you want. In the other PR thread you mentioned wanting to specifically bypass DaemonSets too, which that annotation supports.

Sorry if it seems like I'm being difficult, I just don't yet see the need to pass this information to every strategy.

No problem, it's not always obvious what the preferred direction is when twiddling in someone else's code base, and as it turns out, I misread what the evict annotation actually does, so I'm happy for another set of eyes while I come to grips with things.

If you're happy with the StrategyParam direction, I'll give this a go with the evict annotation and see how it goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's do that -- I agree that there are cases when these affinity strategies may want to evict even if the pod can't fit anywhere else, so adding another Param for these strategies makes sense to me.

I'm also assuming that if you enable that Param the user must know what they are doing, because as @ingvagabund pointed out if you're not properly handling such an evict-at-all-costs case you will end up with stuck Pending pods. I don't think that's the descheduler's concern though (as long as the possibility is clearly documented)

@seanmalloy
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 29, 2020
@seanmalloy
Copy link
Member

One thing I noticed is that this PR does not update any of the end user documentation. I believe the new --degradation-allowed CLI option should be documented in the file docs/user-guide.md.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 29, 2020
@pmundt
Copy link
Author

pmundt commented May 29, 2020

@seanmalloy good point, I've pushed out some documentation updates now.

Copy link
Contributor

@ingvagabund ingvagabund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behaviour is to not deschedule a Pod if it cannot be placed
anywhere else, this is fine for the default case where nodes are more
or less homogeneous, but it does not suit the case of heterogeneous
clusters where specific nodes may have their own unique hardware
configurations that are not be reproduced anywhere else within the
cluster.

In the case of heterogeneous accelerators, for example, there may be
pods that have a hard dependency on a specific resource (e.g. including
container runtimes geared at a specific accelerator), where allowing
them to continue running would produce undesired and unpredictable
behaviour. Consider the case of a Pod with a hard dependency on a
USB-attached accelerator, which may disappear during the lifecycle of
the Pod.

IIUC, you have a set of pods which might lose one of necessary conditions for running properly (e.g. USB-attached accelerator disappeared) and you want descheduler to find such pods through e.g. pod violating node affinity/selector strategy with a promise the strategy will eventually pick all the pods and evict them. Though, given the chance of any of the pods violating the node affinity/selector is not high, you want to allow "degradation" to increase the likelihood. Is my understanding correct?

@@ -45,7 +45,7 @@ func RemovePodsViolatingNodeAffinity(ctx context.Context, client clientset.Inter

for _, pod := range pods {
if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil && pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
if !nodeutil.PodFitsCurrentNode(pod, node) && nodeutil.PodFitsAnyNode(pod, nodes) {
if !nodeutil.PodFitsCurrentNode(pod, node) && (nodeutil.PodFitsAnyNode(pod, nodes) || podEvictor.DegradationAllowed == true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to ignore nodeutil.PodFitsAnyNode since it contains must conditions for a pod getting descheduled. So even when you allow degradation your pod will stay in Pending state until a suitable node is found.

@pmundt
Copy link
Author

pmundt commented Jun 2, 2020

Yes, that's basically correct. In our case we are running these Pods across multiple (heterogeneous) Edge gateways (in a vehicle fleet) in which we rely on feature discovery / labelling to determine the precise capabilities of the Gateway and to schedule the requisite Pods on it - in this case, DaemonSets for metric exporters and device plugins, and normal Deployments for Pods that are using a specific container runtime + inference model adapted for that specific accelerator. In the event when there is a hardware change at the Gateway level we basically want to immediately deschedule Pods that had a tight coupling to the hardware and schedule replacement Pods that use an alternative hardware configuration (we handle this separately, but I mention it so you get a better idea of the scenario).

The USB removal scenario is one example, the somewhat more frequent case we have to deal with is an accelerator suddenly becoming unavailable to the node because it's been tasked with a more "important" task (e.g. we may be running an inference model on a GPU for monitoring road conditions, but then someone in the backseat decides they want to play a game, so we need to free the resource up in a reasonable amount of time and shift on to something else that can take over the inference workload). I wrote a separate label monitor that can immediately trigger the descheduler cronjob on label changes (as well as notify our orchestrator), and this, together with the degradation mode seems to be working as we need it to.

@pmundt
Copy link
Author

pmundt commented Jun 3, 2020

Based on discussion with @damemi, this PR has been superseded by #311.

@pmundt pmundt closed this Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants