Permit descheduling of critical pods #378

roobert · 2020-08-18T18:41:38Z

We scale down our dev and test GKE clusters to 0 nodes for roughly 8 hours a day - over a month this is roughly a 50% cost saving.

The problem that we have is when we scale the clusters back up, kube-dns often gets multiple instances scheduled to the same node which can often prevent our other services being scheduled properly. This is because we use the smallest possible nodes, again, for cost saving reasons.

This is a known issue: kubernetes/kubernetes#52193 and was at one point solved with antiAffinity rules being added, however, the rules caused performance issues and were eventually removed.

Since the changes have been removed from kubernetes core and we can't modify the configuration of kube-system pods in GKE ourselves, the only other option seems to be to use descheduler, however, in the descheduler docs under "Pod Evictions" it says that critical pods can never be evicted: https://github.com/kubernetes-sigs/descheduler#pod-evictions

I can understand the reason why this is a sensible default but was wondering if the project would be open to a patch to allow this to be configurable?

lixiang233 · 2020-08-19T01:56:55Z

Currently, the only way to evict system-critical pods is adding annotation descheduler.alpha.kubernetes.io/evict to pods that you want to evict, as you mentioned that we can't modify the configuration of kube-system pods in GKE ourselves, I'm not sure if this is helpful.

I think it's reasonable to add another flag evict-system-critical-pods to descheduler, just like evict-local-storage-pods, wdyt @damemi @ingvagabund @seanmalloy

seanmalloy · 2020-08-19T02:50:53Z

@roobert and @lixiang233 this seems reasonable to me. @damemi and @ingvagabund what to you think?

ingvagabund · 2020-08-19T11:37:13Z

@roobert by a critical pod you mean a pod whose priority class is system-critical (2000000000) ? Or given by IsCriticalPod (including mirror/static pod)?

ingvagabund · 2020-08-19T11:49:15Z

I would like to avoid introducing a new flag. At the same time it will be bothersome to have set the option for every strategy. I don't see it as a blocking issue though eventually it will be more practical to move all the relevant flags (node selector, evict with local storage) under a versioned config. If you need this really soon (as part of 1.19 release), I am fine with going ahead and adding the flag though there's a high chance the flag will get deprecated soon (e.g. in 1.20).

ingvagabund · 2020-08-19T12:19:01Z

Or, you might put the new flag under DeschedulerPolicy. E.g.:

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
evictSystemCriticalPods: true << the new field here
strategies:
  "RemoveDuplicates":
     enabled: true
     params:
       removeDuplicates:
         excludeOwnerKinds:
         - "ReplicaSet"

The same could be done for the node selector and evict-local-storage-pods bits.

ingvagabund · 2020-08-19T12:37:26Z

#380 for promoting some flags into v1alpha1 descheduler policy fields

roobert · 2020-08-19T13:06:12Z

@roobert by a critical pod you mean a pod whose priority class is system-critical (2000000000) ? Or given by IsCriticalPod (including mirror/static pod)?

Hi @ingvagabund and all, thank you for the feedback!

I've included the output from kubectl get pods --namespace=kube-system kube-dns-XXX --output yaml (GKE: 1.15.12-gke.2 - this is the "stable" track in GCP):

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
    seccomp.security.alpha.kubernetes.io/pod: docker/default
  creationTimestamp: "2020-08-18T23:00:25Z"
  generateName: kube-dns-5c446b66bd-
  labels:
    k8s-app: kube-dns
    pod-template-hash: 5c446b66bd
  name: kube-dns-5c446b66bd-75x45
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: kube-dns-5c446b66bd
    uid: 71773671-3d53-4293-a7fe-b139a2e5270f
  resourceVersion: "1054522"
  selfLink: /api/v1/namespaces/kube-system/pods/kube-dns-5c446b66bd-75x45
  uid: 86c94bba-af29-4c75-a3ae-22aecfa1f573
spec:
  containers:
  - args:
    - --domain=cluster.local.
    - --dns-port=10053
    - --config-dir=/kube-dns-config
    - --v=2
    env:
    - name: PROMETHEUS_PORT
      value: "10055"
    image: k8s.gcr.io/k8s-dns-kube-dns-amd64:1.15.8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /healthcheck/kubedns
        port: 10054
        scheme: HTTP
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: kubedns
    ports:
    - containerPort: 10053
      name: dns-local
      protocol: UDP
    - containerPort: 10053
      name: dns-tcp-local
      protocol: TCP
    - containerPort: 10055
      name: metrics
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /readiness
        port: 8081
        scheme: HTTP
      initialDelaySeconds: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    resources:
      limits:
        memory: 170Mi
      requests:
        cpu: 100m
        memory: 70Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /kube-dns-config
      name: kube-dns-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-dns-token-jhn8f
      readOnly: true
  - args:
    - -v=2
    - -logtostderr
    - -configDir=/etc/k8s/dns/dnsmasq-nanny
    - -restartDnsmasq=true
    - --
    - -k
    - --cache-size=1000
    - --no-negcache
    - --dns-forward-max=1500
    - --log-facility=-
    - --server=/cluster.local/127.0.0.1#10053
    - --server=/in-addr.arpa/127.0.0.1#10053
    - --server=/ip6.arpa/127.0.0.1#10053
    image: k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.15.8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /healthcheck/dnsmasq
        port: 10054
        scheme: HTTP
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: dnsmasq
    ports:
    - containerPort: 53
      name: dns
      protocol: UDP
    - containerPort: 53
      name: dns-tcp
      protocol: TCP
    resources:
      requests:
        cpu: 150m
        memory: 20Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/k8s/dns/dnsmasq-nanny
      name: kube-dns-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-dns-token-jhn8f
      readOnly: true
  - args:
    - --v=2
    - --logtostderr
    - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,SRV
    - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,SRV
    image: k8s.gcr.io/k8s-dns-sidecar-amd64:1.15.8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /metrics
        port: 10054
        scheme: HTTP
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: sidecar
    ports:
    - containerPort: 10054
      name: metrics
      protocol: TCP
    resources:
      requests:
        cpu: 10m
        memory: 20Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-dns-token-jhn8f
      readOnly: true
  - command:
    - /monitor
    - --source=kubedns:http://localhost:10054?whitelisted=probe_kubedns_latency_ms,probe_kubedns_errors,dnsmasq_misses,dnsmasq_hits
    - --stackdriver-prefix=container.googleapis.com/internal/addons
    - --api-override=https://monitoring.googleapis.com/
    - --pod-id=$(POD_NAME)
    - --namespace-id=$(POD_NAMESPACE)
    - --v=2
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    image: k8s.gcr.io/prometheus-to-sd:v0.4.2
    imagePullPolicy: IfNotPresent
    name: prometheus-to-sd
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-dns-token-jhn8f
      readOnly: true
  dnsPolicy: Default
  enableServiceLinks: true
  nodeName: gke-cluster0-nodepool0-fc744471-ppp6
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: kube-dns
  serviceAccountName: kube-dns
  terminationGracePeriodSeconds: 30
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - key: components.gke.io/gke-managed-components
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
---8<--snip--8<---

So we can see:
metadata.annotations.scheduler.alpha.kubernetes.io/critical-pod = ""
spec.priority = 2000000000
spec.priorityClassName = system-cluster-critical

So I think I need to configure the priorityClass and also use the evict-system-critical-pods option I've started adding in #379?

I have no preference for how this functionality is provided so I am happy to comply to any suggestions, however, if this type of option is already provided (evict-storage-local-pods) then maybe it'd be OK to merge in my PR and then treat changing this group of options as a separate PR?

I've created a PR which essentially copies the evict-storage-local-pods pattern and if you're happy for us to proceed down this path then I'll add some tests.

ingvagabund · 2020-08-19T19:19:01Z

I have no preference for how this functionality is provided so I am happy to comply to any suggestions, however, if this type of option is already provided (evict-storage-local-pods) then maybe it'd be OK to merge in my PR and then treat changing this group of options as a separate PR?

As long as both PRs are merged in the same release.

Based on the pod manifest I see you are running k8s v1.15.8. Are you in need to merge this soon to have it in 1.19? Or, would 1.20 be acceptable?

roobert · 2020-08-19T20:31:59Z

@ingvagabund - no rush from my perspective!

damemi · 2020-08-20T13:58:52Z

This is a reasonable use case, but like it was mentioned this could be dangerous to cluster stability if it's not carefully implemented and the effects are made clear to users: this will evict kube-system pods (really the only reason this is needed).

Given that risk and the fact that @roobert says there's no rush, I think it would be best to get this in for 1.20 (especially since we are trying to release 1.19 sometime next week). This way we can make sure there is thorough review and get input from more parties.

roobert · 2020-08-24T11:07:38Z

Hey, I've now had time to test this locally and found that another change is necessary for this to work, specifically: the maximum thresholdPriority value is 2000000000 which is the same as the kube-dns pod value and as such the pod always gets marked as non-evictable.

I'm not sure what the preferred way to work around this would be? Should we increase the maximum thresholdPriority value to 2000000001 or similar? Or would a flag to disable thresholdPriority checking be better? I tried setting thresholdPriorityClass to some fake class as a workaround but that failed for some other reason.

ingvagabund · 2020-08-24T11:50:39Z

If you are allowed to evict a critical pod, you are allowed to evict any pod. So setting --evict-system-critical-pods will effectively disable check for any priority threshold.

roobert · 2020-08-24T12:21:19Z

@ingvagabund - is that right? I'll rerun some tests to double check my original findings!

ingvagabund · 2020-08-24T14:10:16Z

@ingvagabund - is that right? I'll rerun some tests to double check my original findings!

I am talking about the future implementation as the expected solution. Unless, the error you are describing is produced by the Evict call itself.

ingvagabund · 2020-09-20T11:45:34Z

@roobert are you still interesting in implementing the feature?

ingvagabund · 2020-09-20T11:48:53Z

/good-first-issue

k8s-ci-robot · 2020-09-20T11:48:54Z

@ingvagabund:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-12-19T13:11:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

seanmalloy · 2020-12-21T16:09:01Z

/remove-lifecycle stale

RyanDevlin · 2021-02-24T14:37:50Z

If no one is still working on this I can give it a shot.

ingvagabund · 2021-02-24T14:49:51Z

Feel free to pick it up :)

RyanDevlin · 2021-02-24T15:22:40Z

/assign

RyanDevlin · 2021-02-24T18:22:17Z

@ingvagabund you had mentioned in #380 "Keeping descheduling-interval flag as it is since it's not related to a particular strategy but to the overall controller configuration".

Following this logic and the conversation above, should the solution to this only be implemented as part of the descheduler policy configuration? Or should it be both a command line flag and part of the policy?

ingvagabund · 2021-02-24T18:35:06Z

The less command line flags, the better. The problem with flags is they are not versioned. Policy is.

RyanDevlin · 2021-02-24T18:59:46Z

Good point, in that case I'll implement it as a policy configuration.

RyanDevlin · 2021-03-01T23:28:03Z

Should enabling evictSystemCriticalPods cause mirror pods, static pods, daemonsets, or pods without ownerRefs to be evicted? Currently the, IsEvictable() function hard-codes some of these checks into the descheduler.

Or are we to assume that the only use case for this feature is to evict pods that have a critical priority, but also have ownerRefs, are not part of a daemonset, and are not mirror/static pods?

ingvagabund · 2021-03-02T08:30:02Z

Every evictable pod has to be also re-creatable. Thus non-empty ownerRefs is a must. Static/mirror pods get scheduled to the same node (as they are stored locally on each node under /etc/kubernetes/manifests directory) so there's no point evicting those. Also, in many cases, these pods are core pods or pods supporting core pods. Similar holds for pods owned by a deamon set which will get scheduled to the same node they get evicted from. None of these cases are worth descheduling. There are more suitable for priority and preemption policies enforced by the scheduler which evicts lower priority pods to create space for higher priority pods. Descheduler's use case is to load balance pods to normalize resource usage among nodes (and some other cases which are not covered by priority and preemption policies).

Should enabling evictSystemCriticalPods cause mirror pods, static pods, daemonsets, or pods without ownerRefs to be evicted?

So the answer is no.

RyanDevlin · 2021-03-10T14:03:50Z

@ingvagabund I've completed the work on this feature and opened a pull request here: #522

This is my first pull request to Kubernetes, so I'm happy to change things if I've submitted something incorrectly.

roobert added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 18, 2020

roobert mentioned this issue Aug 19, 2020

[WIP] Add evict-system-critical-pods option #379

Closed

seanmalloy mentioned this issue Sep 15, 2020

Kubernetes 1.20 Release Cycle #400

Closed

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Sep 20, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2020

k8s-ci-robot assigned RyanDevlin Feb 24, 2021

RyanDevlin mentioned this issue Mar 10, 2021

Added evictSystemCriticalPods flag to allow for eviction of pods of any priority #378 #522

Closed

RyanDevlin mentioned this issue Mar 10, 2021

Added EvictSystemCriticalPods flag to descheduler #523

Merged

k8s-ci-robot closed this as completed in #523 Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Permit descheduling of critical pods #378

Permit descheduling of critical pods #378

roobert commented Aug 18, 2020 •

edited

Loading

lixiang233 commented Aug 19, 2020

seanmalloy commented Aug 19, 2020

ingvagabund commented Aug 19, 2020 •

edited

Loading

ingvagabund commented Aug 19, 2020

ingvagabund commented Aug 19, 2020

ingvagabund commented Aug 19, 2020

roobert commented Aug 19, 2020 •

edited

Loading

ingvagabund commented Aug 19, 2020

roobert commented Aug 19, 2020

damemi commented Aug 20, 2020

roobert commented Aug 24, 2020

ingvagabund commented Aug 24, 2020

roobert commented Aug 24, 2020

ingvagabund commented Aug 24, 2020

ingvagabund commented Sep 20, 2020 •

edited

Loading

ingvagabund commented Sep 20, 2020

k8s-ci-robot commented Sep 20, 2020

fejta-bot commented Dec 19, 2020

seanmalloy commented Dec 21, 2020

RyanDevlin commented Feb 24, 2021

ingvagabund commented Feb 24, 2021

RyanDevlin commented Feb 24, 2021

RyanDevlin commented Feb 24, 2021

ingvagabund commented Feb 24, 2021

RyanDevlin commented Feb 24, 2021

RyanDevlin commented Mar 1, 2021 •

edited

Loading

ingvagabund commented Mar 2, 2021 •

edited

Loading

RyanDevlin commented Mar 10, 2021

Permit descheduling of critical pods #378

Permit descheduling of critical pods #378

Comments

roobert commented Aug 18, 2020 • edited Loading

lixiang233 commented Aug 19, 2020

seanmalloy commented Aug 19, 2020

ingvagabund commented Aug 19, 2020 • edited Loading

ingvagabund commented Aug 19, 2020

ingvagabund commented Aug 19, 2020

ingvagabund commented Aug 19, 2020

roobert commented Aug 19, 2020 • edited Loading

ingvagabund commented Aug 19, 2020

roobert commented Aug 19, 2020

damemi commented Aug 20, 2020

roobert commented Aug 24, 2020

ingvagabund commented Aug 24, 2020

roobert commented Aug 24, 2020

ingvagabund commented Aug 24, 2020

ingvagabund commented Sep 20, 2020 • edited Loading

ingvagabund commented Sep 20, 2020

k8s-ci-robot commented Sep 20, 2020

fejta-bot commented Dec 19, 2020

seanmalloy commented Dec 21, 2020

RyanDevlin commented Feb 24, 2021

ingvagabund commented Feb 24, 2021

RyanDevlin commented Feb 24, 2021

RyanDevlin commented Feb 24, 2021

ingvagabund commented Feb 24, 2021

RyanDevlin commented Feb 24, 2021

RyanDevlin commented Mar 1, 2021 • edited Loading

ingvagabund commented Mar 2, 2021 • edited Loading

RyanDevlin commented Mar 10, 2021

roobert commented Aug 18, 2020 •

edited

Loading

ingvagabund commented Aug 19, 2020 •

edited

Loading

roobert commented Aug 19, 2020 •

edited

Loading

ingvagabund commented Sep 20, 2020 •

edited

Loading

RyanDevlin commented Mar 1, 2021 •

edited

Loading

ingvagabund commented Mar 2, 2021 •

edited

Loading