-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for degradation #301
Conversation
The default behaviour is to not deschedule a Pod if it cannot be placed anywhere else, this is fine for the default case where nodes are more or less homogeneous, but it does not suit the case of heterogeneous clusters where specific nodes may have their own unique hardware configurations that are not be reproduced anywhere else within the cluster. In the case of heterogeneous accelerators, for example, there may be pods that have a hard dependency on a specific resource (e.g. including container runtimes geared at a specific accelerator), where allowing them to continue running would produce undesired and unpredictable behaviour. Consider the case of a Pod with a hard dependency on a USB-attached accelerator, which may disappear during the lifecycle of the Pod. This adds a new `--degradation-allowed` flag that specifically allows Pods to indicate when they should be descheduled after a label they depend on suddenly goes away.
Welcome @pmundt! |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pmundt The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/assign @k82cn |
@@ -61,6 +63,7 @@ func NewPodEvictor( | |||
client: client, | |||
policyGroupVersion: policyGroupVersion, | |||
dryRun: dryRun, | |||
DegradationAllowed: degradationAllowed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DegradationAllowed
field is not used anywhere inside PodEvictor methods. Is this PR still WIP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's used by the eviction strategy, which is passed in the pod evictor. I opted to make this exportable so we could access the flag there, rather than having to modify every single eviction strategy callsite to add in another argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see: podEvictor.DegradationAllowed
. It's not a good practice.
rather than having to modify every single eviction strategy callsite to add in another argument
However, it's the right thing to do from the maintainability perspective. If you need to access DegradationAllowed through PodEvictor
, introducing new method is more practical than accessing DegradationAllowed
directly. Also, DegradationAllowed
is not related to evicting pod itself, rather to relaxing constraint of selecting pods to be evicted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that changing every function every time we add a new flag is worse from a maintainability point of view, but it's not my code base, so I'm happy to follow whatever convention people like. In terms of the degradation, you are correct, we are presently doing this through a global flag and applying it across all pods that satisfy the criteria. We could also consider pushing this down to the pod level through an additional annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that changing every function every time we add a new flag
You are right. The strategyFunction
data type is still evolving. I am fine with changing the signature [1] to:
type strategyFunction func(
ctx context.Context,
client clientset.Interface,
strategy api.DeschedulerStrategy,
nodes []*v1.Node,
opts Options,
podEvictor *evictions.PodEvictor,
)
where Options
can be defined as:
type Options struct {
EvictLocalStoragePods bool
DegradationAllowed bool
}
as long as Options
contains fields generic for any strategy.
[1] https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/descheduler.go#L63
We could also consider pushing this down to the pod level through an additional annotation?
If I understand the intention correctly, DegradationAllowed
is a strategy level configuration. As such, the option needs to be passed into the strategy before it's ran.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmundt if this is only used by the nodeAffinity
(and, in your new strategy nodeSelector
), does it have to be a global setting? It could just be a StrategyParam
field for those right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the other PR (sorry for the run-around, I mention it here for posterity), we would also plan to evaluate it within IsEvictable
in cases where we know the pod can be explicitly degraded. I'm therefore not sure that the StrategyParam
would be sufficient, unless we're also able to access this from the pod package somehow. In terms of the global setting, I don't mind if we get rid of this and use a Pod annotation or similar, the important thing is simply that we have a mechanism to degrade specific node-local Pods - I'll defer to you on whichever option you find more palatable! If we leave it as an annotation, we could presumably also leave it as a StrategyParam
, as the annotation could be tested independently within IsEvictable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of the global setting, I don't mind if we get rid of this and use a Pod annotation or similar, the important thing is simply that we have a mechanism to degrade specific node-local Pods
Is this something that the existing descheduler.alpha.kubernetes.io/evict
annotation could solve? It is already checked in IsEvictable
and effectively bypasses that check for specific pods.
If we leave it as an annotation, we could presumably also leave it as a StrategyParam, as the annotation could be tested independently within IsEvictable.
With the above annotation^ I think this is what you want. In the other PR thread you mentioned wanting to specifically bypass DaemonSets too, which that annotation supports.
Sorry if it seems like I'm being difficult, I just don't yet see the need to pass this information to every strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that the existing
descheduler.alpha.kubernetes.io/evict
annotation could solve? It is already checked inIsEvictable
and effectively bypasses that check for specific pods.
You're right, for some reason when I first took a look at the evict annotation I missed this. I think this would do the job, yes.
With the above annotation^ I think this is what you want. In the other PR thread you mentioned wanting to specifically bypass DaemonSets too, which that annotation supports.
Sorry if it seems like I'm being difficult, I just don't yet see the need to pass this information to every strategy.
No problem, it's not always obvious what the preferred direction is when twiddling in someone else's code base, and as it turns out, I misread what the evict annotation actually does, so I'm happy for another set of eyes while I come to grips with things.
If you're happy with the StrategyParam
direction, I'll give this a go with the evict annotation and see how it goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's do that -- I agree that there are cases when these affinity strategies may want to evict even if the pod can't fit anywhere else, so adding another Param for these strategies makes sense to me.
I'm also assuming that if you enable that Param the user must know what they are doing, because as @ingvagabund pointed out if you're not properly handling such an evict-at-all-costs case you will end up with stuck Pending pods. I don't think that's the descheduler's concern though (as long as the possibility is clearly documented)
/kind feature |
One thing I noticed is that this PR does not update any of the end user documentation. I believe the new |
@seanmalloy good point, I've pushed out some documentation updates now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behaviour is to not deschedule a Pod if it cannot be placed
anywhere else, this is fine for the default case where nodes are more
or less homogeneous, but it does not suit the case of heterogeneous
clusters where specific nodes may have their own unique hardware
configurations that are not be reproduced anywhere else within the
cluster.
In the case of heterogeneous accelerators, for example, there may be
pods that have a hard dependency on a specific resource (e.g. including
container runtimes geared at a specific accelerator), where allowing
them to continue running would produce undesired and unpredictable
behaviour. Consider the case of a Pod with a hard dependency on a
USB-attached accelerator, which may disappear during the lifecycle of
the Pod.
IIUC, you have a set of pods which might lose one of necessary conditions for running properly (e.g. USB-attached accelerator disappeared) and you want descheduler to find such pods through e.g. pod violating node affinity/selector strategy with a promise the strategy will eventually pick all the pods and evict them. Though, given the chance of any of the pods violating the node affinity/selector is not high, you want to allow "degradation" to increase the likelihood. Is my understanding correct?
@@ -45,7 +45,7 @@ func RemovePodsViolatingNodeAffinity(ctx context.Context, client clientset.Inter | |||
|
|||
for _, pod := range pods { | |||
if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil && pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil { | |||
if !nodeutil.PodFitsCurrentNode(pod, node) && nodeutil.PodFitsAnyNode(pod, nodes) { | |||
if !nodeutil.PodFitsCurrentNode(pod, node) && (nodeutil.PodFitsAnyNode(pod, nodes) || podEvictor.DegradationAllowed == true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't want to ignore nodeutil.PodFitsAnyNode
since it contains must conditions for a pod getting descheduled. So even when you allow degradation your pod will stay in Pending state until a suitable node is found.
Yes, that's basically correct. In our case we are running these Pods across multiple (heterogeneous) Edge gateways (in a vehicle fleet) in which we rely on feature discovery / labelling to determine the precise capabilities of the Gateway and to schedule the requisite Pods on it - in this case, DaemonSets for metric exporters and device plugins, and normal Deployments for Pods that are using a specific container runtime + inference model adapted for that specific accelerator. In the event when there is a hardware change at the Gateway level we basically want to immediately deschedule Pods that had a tight coupling to the hardware and schedule replacement Pods that use an alternative hardware configuration (we handle this separately, but I mention it so you get a better idea of the scenario). The USB removal scenario is one example, the somewhat more frequent case we have to deal with is an accelerator suddenly becoming unavailable to the node because it's been tasked with a more "important" task (e.g. we may be running an inference model on a GPU for monitoring road conditions, but then someone in the backseat decides they want to play a game, so we need to free the resource up in a reasonable amount of time and shift on to something else that can take over the inference workload). I wrote a separate label monitor that can immediately trigger the descheduler cronjob on label changes (as well as notify our orchestrator), and this, together with the degradation mode seems to be working as we need it to. |
The default behaviour is to not deschedule a Pod if it cannot be placed
anywhere else, this is fine for the default case where nodes are more
or less homogeneous, but it does not suit the case of heterogeneous
clusters where specific nodes may have their own unique hardware
configurations that are not be reproduced anywhere else within the
cluster.
In the case of heterogeneous accelerators, for example, there may be
pods that have a hard dependency on a specific resource (e.g. including
container runtimes geared at a specific accelerator), where allowing
them to continue running would produce undesired and unpredictable
behaviour. Consider the case of a Pod with a hard dependency on a
USB-attached accelerator, which may disappear during the lifecycle of
the Pod.
This adds a new
--degradation-allowed
flag that specifically allowsPods to indicate when they should be descheduled after a label they
depend on suddenly goes away.