Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for degradation #301

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 18 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ pod can or can not be scheduled, are guided by its configurable policy which com
rules, called predicates and priorities. The scheduler's decisions are influenced by its view of
a Kubernetes cluster at that point of time when a new pod appears for scheduling.
As Kubernetes clusters are very dynamic and their state changes over time, there may be desire
to move already running pods to some other nodes for various reasons:
to move already running pods to some other nodes, or to allow the pods to be terminated outright:

* Some nodes are under or over utilized.
* The original scheduling decision does not hold true any more, as taints or labels are added to
Expand All @@ -20,9 +20,12 @@ or removed from nodes, pod/node affinity requirements are not satisfied any more
* New nodes are added to clusters.

Consequently, there might be several pods scheduled on less desired nodes in a cluster.
Descheduler, based on its policy, finds pods that can be moved and evicts them. Please
note, in current implementation, descheduler does not schedule replacement of evicted pods
but relies on the default scheduler for that.
Descheduler, based on its policy, finds pods that can be moved and evicts them. By default,
Descheduler aims to ensure that there is no service degradation across the cluster. In the case
where a pod is no longer runnable on a node and no suitable movement candidate can be found,
Descheduler can also optionally terminate the problematic pod, when service degradation is allowed.
Please note, in its current implementation, descheduler does not schedule replacements of evicted
or terminated pods but instead relies on the default scheduler for that.

## Quick Start

Expand Down Expand Up @@ -238,6 +241,17 @@ never evicted because these pods won't be recreated.
Pods subject to a Pod Disruption Budget(PDB) are not evicted if descheduling violates its PDB. The pods
are evicted by using the eviction subresource to handle PDB.

### Degradation

By default, Pods marked for eviction are only evicted if a suitable node can be found for rescheduling.
This is done to ensure that in the case where no suitable candidate for rescheduling is found, the
pod will continue to run on its current node.

In certain cases, such as when a pod is scheduled based on labelling criteria which is no longer satisfied,
it can be preferable (and, at times, essential) to terminate the running pod even if it does not have a
rescheduling candidate. This behaviour can be enabled by running Descheduler in an allowed degradation mode,
activated by the `--degradation-allowed` CLI argument.

## Compatibility Matrix
The below compatibility matrix shows the k8s client package(client-go, apimachinery, etc) versions that descheduler
is compiled with. At this time descheduler does not have a hard dependency to a specific k8s release. However a
Expand Down
1 change: 1 addition & 0 deletions cmd/descheduler/app/options/options.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ func (rs *DeschedulerServer) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&rs.KubeconfigFile, "kubeconfig", rs.KubeconfigFile, "File with kube configuration.")
fs.StringVar(&rs.PolicyConfigFile, "policy-config-file", rs.PolicyConfigFile, "File with descheduler policy configuration.")
fs.BoolVar(&rs.DryRun, "dry-run", rs.DryRun, "execute descheduler in dry run mode.")
fs.BoolVar(&rs.DegradationAllowed, "degradation-allowed", rs.DegradationAllowed, "Allow descheduling of Pods that have no rescheduling candidates")
// node-selector query causes descheduler to run only on nodes that matches the node labels in the query
fs.StringVar(&rs.NodeSelector, "node-selector", rs.NodeSelector, "Selector (label query) to filter on, supports '=', '==', and '!='.(e.g. -l key1=value1,key2=value2)")
// max-no-pods-to-evict limits the maximum number of pods to be evicted per node by descheduler.
Expand Down
1 change: 1 addition & 0 deletions docs/user-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Available Commands:
Flags:
--add-dir-header If true, adds the file directory to the header
--alsologtostderr log to standard error as well as files
--degradation-allowed Allow descheduling of Pods that have no rescheduling candidates
--descheduling-interval duration Time interval between two consecutive descheduler executions. Setting this value instructs the descheduler to run in a continuous loop at the interval specified.
--dry-run execute descheduler in dry run mode.
--evict-local-storage-pods Enables evicting pods using local storage by descheduler
Expand Down
3 changes: 3 additions & 0 deletions pkg/apis/componentconfig/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ type DeschedulerConfiguration struct {
// Dry run
DryRun bool

// Degradation allowed
DegradationAllowed bool

// Node selectors
NodeSelector string

Expand Down
3 changes: 3 additions & 0 deletions pkg/apis/componentconfig/v1alpha1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ type DeschedulerConfiguration struct {
// Dry run
DryRun bool `json:"dryRun,omitempty"`

// Degradation allowed
DegradationAllowed bool `json:"degradationAllowed,omitempty"`

// Node selectors
NodeSelector string `json:"nodeSelector,omitempty"`

Expand Down
3 changes: 2 additions & 1 deletion pkg/descheduler/descheduler.go
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ func RunDeschedulerStrategies(ctx context.Context, rs *options.DeschedulerServer
return
}

if len(nodes) <= 1 {
if len(nodes) <= 1 && rs.DegradationAllowed == false {
klog.V(1).Infof("The cluster size is 0 or 1 meaning eviction causes service disruption or degradation. So aborting..")
close(stopChannel)
return
Expand All @@ -97,6 +97,7 @@ func RunDeschedulerStrategies(ctx context.Context, rs *options.DeschedulerServer
rs.Client,
evictionPolicyGroupVersion,
rs.DryRun,
rs.DegradationAllowed,
rs.MaxNoOfPodsToEvictPerNode,
nodes,
)
Expand Down
3 changes: 3 additions & 0 deletions pkg/descheduler/evictions/evictions.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ type PodEvictor struct {
client clientset.Interface
policyGroupVersion string
dryRun bool
DegradationAllowed bool
maxPodsToEvict int
nodepodCount nodePodEvictedCount
}
Expand All @@ -48,6 +49,7 @@ func NewPodEvictor(
client clientset.Interface,
policyGroupVersion string,
dryRun bool,
degradationAllowed bool,
maxPodsToEvict int,
nodes []*v1.Node,
) *PodEvictor {
Expand All @@ -61,6 +63,7 @@ func NewPodEvictor(
client: client,
policyGroupVersion: policyGroupVersion,
dryRun: dryRun,
DegradationAllowed: degradationAllowed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DegradationAllowed field is not used anywhere inside PodEvictor methods. Is this PR still WIP?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's used by the eviction strategy, which is passed in the pod evictor. I opted to make this exportable so we could access the flag there, rather than having to modify every single eviction strategy callsite to add in another argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see: podEvictor.DegradationAllowed. It's not a good practice.

rather than having to modify every single eviction strategy callsite to add in another argument

However, it's the right thing to do from the maintainability perspective. If you need to access DegradationAllowed through PodEvictor, introducing new method is more practical than accessing DegradationAllowed directly. Also, DegradationAllowed is not related to evicting pod itself, rather to relaxing constraint of selecting pods to be evicted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that changing every function every time we add a new flag is worse from a maintainability point of view, but it's not my code base, so I'm happy to follow whatever convention people like. In terms of the degradation, you are correct, we are presently doing this through a global flag and applying it across all pods that satisfy the criteria. We could also consider pushing this down to the pod level through an additional annotation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that changing every function every time we add a new flag

You are right. The strategyFunction data type is still evolving. I am fine with changing the signature [1] to:

type strategyFunction func(
	ctx context.Context,
	client clientset.Interface,
	strategy api.DeschedulerStrategy,
	nodes []*v1.Node,
	opts Options,
	podEvictor *evictions.PodEvictor,
)

where Options can be defined as:

type Options struct {
	EvictLocalStoragePods bool
	DegradationAllowed bool
}

as long as Options contains fields generic for any strategy.

[1] https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/descheduler.go#L63

We could also consider pushing this down to the pod level through an additional annotation?

If I understand the intention correctly, DegradationAllowed is a strategy level configuration. As such, the option needs to be passed into the strategy before it's ran.

Copy link
Contributor

@damemi damemi Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmundt if this is only used by the nodeAffinity (and, in your new strategy nodeSelector), does it have to be a global setting? It could just be a StrategyParam field for those right?

Copy link
Author

@pmundt pmundt Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the other PR (sorry for the run-around, I mention it here for posterity), we would also plan to evaluate it within IsEvictable in cases where we know the pod can be explicitly degraded. I'm therefore not sure that the StrategyParam would be sufficient, unless we're also able to access this from the pod package somehow. In terms of the global setting, I don't mind if we get rid of this and use a Pod annotation or similar, the important thing is simply that we have a mechanism to degrade specific node-local Pods - I'll defer to you on whichever option you find more palatable! If we leave it as an annotation, we could presumably also leave it as a StrategyParam, as the annotation could be tested independently within IsEvictable.

Copy link
Contributor

@damemi damemi Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the global setting, I don't mind if we get rid of this and use a Pod annotation or similar, the important thing is simply that we have a mechanism to degrade specific node-local Pods

Is this something that the existing descheduler.alpha.kubernetes.io/evict annotation could solve? It is already checked in IsEvictable and effectively bypasses that check for specific pods.

If we leave it as an annotation, we could presumably also leave it as a StrategyParam, as the annotation could be tested independently within IsEvictable.

With the above annotation^ I think this is what you want. In the other PR thread you mentioned wanting to specifically bypass DaemonSets too, which that annotation supports.

Sorry if it seems like I'm being difficult, I just don't yet see the need to pass this information to every strategy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that the existing descheduler.alpha.kubernetes.io/evict annotation could solve? It is already checked in IsEvictable and effectively bypasses that check for specific pods.

You're right, for some reason when I first took a look at the evict annotation I missed this. I think this would do the job, yes.

With the above annotation^ I think this is what you want. In the other PR thread you mentioned wanting to specifically bypass DaemonSets too, which that annotation supports.

Sorry if it seems like I'm being difficult, I just don't yet see the need to pass this information to every strategy.

No problem, it's not always obvious what the preferred direction is when twiddling in someone else's code base, and as it turns out, I misread what the evict annotation actually does, so I'm happy for another set of eyes while I come to grips with things.

If you're happy with the StrategyParam direction, I'll give this a go with the evict annotation and see how it goes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's do that -- I agree that there are cases when these affinity strategies may want to evict even if the pod can't fit anywhere else, so adding another Param for these strategies makes sense to me.

I'm also assuming that if you enable that Param the user must know what they are doing, because as @ingvagabund pointed out if you're not properly handling such an evict-at-all-costs case you will end up with stuck Pending pods. I don't think that's the descheduler's concern though (as long as the possibility is clearly documented)

maxPodsToEvict: maxPodsToEvict,
nodepodCount: nodePodCount,
}
Expand Down
1 change: 1 addition & 0 deletions pkg/descheduler/strategies/duplicates_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,7 @@ func TestFindDuplicatePods(t *testing.T) {
fakeClient,
"v1",
false,
false,
testCase.maxPodsToEvict,
[]*v1.Node{node},
)
Expand Down
2 changes: 2 additions & 0 deletions pkg/descheduler/strategies/lownodeutilization_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -355,6 +355,7 @@ func TestLowNodeUtilization(t *testing.T) {
fakeClient,
"v1",
false,
false,
test.expectedPodsEvicted,
nodes,
)
Expand Down Expand Up @@ -628,6 +629,7 @@ func TestWithTaints(t *testing.T) {
&fake.Clientset{Fake: *fakePtr},
"policy/v1",
false,
false,
item.evictionsExpected,
item.nodes,
)
Expand Down
2 changes: 1 addition & 1 deletion pkg/descheduler/strategies/node_affinity.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ func RemovePodsViolatingNodeAffinity(ctx context.Context, client clientset.Inter

for _, pod := range pods {
if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil && pod.Spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution != nil {
if !nodeutil.PodFitsCurrentNode(pod, node) && nodeutil.PodFitsAnyNode(pod, nodes) {
if !nodeutil.PodFitsCurrentNode(pod, node) && (nodeutil.PodFitsAnyNode(pod, nodes) || podEvictor.DegradationAllowed == true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to ignore nodeutil.PodFitsAnyNode since it contains must conditions for a pod getting descheduled. So even when you allow degradation your pod will stay in Pending state until a suitable node is found.

klog.V(1).Infof("Evicting pod: %v", pod.Name)
if _, err := podEvictor.EvictPod(ctx, pod, node); err != nil {
klog.Errorf("Error evicting pod: (%#v)", err)
Expand Down
1 change: 1 addition & 0 deletions pkg/descheduler/strategies/node_affinity_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ func TestRemovePodsViolatingNodeAffinity(t *testing.T) {
fakeClient,
"v1",
false,
false,
tc.maxPodsToEvict,
tc.nodes,
)
Expand Down
1 change: 1 addition & 0 deletions pkg/descheduler/strategies/node_taint_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ func TestDeletePodsViolatingNodeTaints(t *testing.T) {
fakeClient,
"v1",
false,
false,
tc.maxPodsToEvict,
tc.nodes,
)
Expand Down
1 change: 1 addition & 0 deletions pkg/descheduler/strategies/pod_antiaffinity_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,7 @@ func TestPodAntiAffinity(t *testing.T) {
fakeClient,
"v1",
false,
false,
test.maxPodsToEvict,
[]*v1.Node{node},
)
Expand Down
1 change: 1 addition & 0 deletions pkg/descheduler/strategies/pod_lifetime_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ func TestPodLifeTime(t *testing.T) {
fakeClient,
"v1",
false,
false,
tc.maxPodsToEvict,
[]*v1.Node{node},
)
Expand Down
1 change: 1 addition & 0 deletions pkg/descheduler/strategies/toomanyrestarts_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,7 @@ func TestRemovePodsHavingTooManyRestarts(t *testing.T) {
fakeClient,
"v1",
false,
false,
tc.maxPodsToEvict,
[]*v1.Node{node},
)
Expand Down
1 change: 1 addition & 0 deletions test/e2e/e2e_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,7 @@ func startEndToEndForLowNodeUtilization(ctx context.Context, clientset clientset
clientset,
evictionPolicyGroupVersion,
false,
false,
0,
nodes,
)
Expand Down