Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permit descheduling of critical pods #378

Closed
roobert opened this issue Aug 18, 2020 · 28 comments · Fixed by #523
Closed

Permit descheduling of critical pods #378

roobert opened this issue Aug 18, 2020 · 28 comments · Fixed by #523
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@roobert
Copy link

roobert commented Aug 18, 2020

We scale down our dev and test GKE clusters to 0 nodes for roughly 8 hours a day - over a month this is roughly a 50% cost saving.

The problem that we have is when we scale the clusters back up, kube-dns often gets multiple instances scheduled to the same node which can often prevent our other services being scheduled properly. This is because we use the smallest possible nodes, again, for cost saving reasons.

This is a known issue: kubernetes/kubernetes#52193 and was at one point solved with antiAffinity rules being added, however, the rules caused performance issues and were eventually removed.

Since the changes have been removed from kubernetes core and we can't modify the configuration of kube-system pods in GKE ourselves, the only other option seems to be to use descheduler, however, in the descheduler docs under "Pod Evictions" it says that critical pods can never be evicted: https://github.com/kubernetes-sigs/descheduler#pod-evictions

I can understand the reason why this is a sensible default but was wondering if the project would be open to a patch to allow this to be configurable?

@roobert roobert added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 18, 2020
@lixiang233
Copy link
Contributor

Currently, the only way to evict system-critical pods is adding annotation descheduler.alpha.kubernetes.io/evict to pods that you want to evict, as you mentioned that we can't modify the configuration of kube-system pods in GKE ourselves, I'm not sure if this is helpful.

I think it's reasonable to add another flag evict-system-critical-pods to descheduler, just like evict-local-storage-pods, wdyt @damemi @ingvagabund @seanmalloy

@seanmalloy
Copy link
Member

@roobert and @lixiang233 this seems reasonable to me. @damemi and @ingvagabund what to you think?

@ingvagabund
Copy link
Contributor

ingvagabund commented Aug 19, 2020

@roobert by a critical pod you mean a pod whose priority class is system-critical (2000000000) ? Or given by IsCriticalPod (including mirror/static pod)?

@ingvagabund
Copy link
Contributor

I would like to avoid introducing a new flag. At the same time it will be bothersome to have set the option for every strategy. I don't see it as a blocking issue though eventually it will be more practical to move all the relevant flags (node selector, evict with local storage) under a versioned config. If you need this really soon (as part of 1.19 release), I am fine with going ahead and adding the flag though there's a high chance the flag will get deprecated soon (e.g. in 1.20).

@ingvagabund
Copy link
Contributor

Or, you might put the new flag under DeschedulerPolicy. E.g.:

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
evictSystemCriticalPods: true << the new field here
strategies:
  "RemoveDuplicates":
     enabled: true
     params:
       removeDuplicates:
         excludeOwnerKinds:
         - "ReplicaSet"

The same could be done for the node selector and evict-local-storage-pods bits.

@ingvagabund
Copy link
Contributor

#380 for promoting some flags into v1alpha1 descheduler policy fields

@roobert
Copy link
Author

roobert commented Aug 19, 2020

@roobert by a critical pod you mean a pod whose priority class is system-critical (2000000000) ? Or given by IsCriticalPod (including mirror/static pod)?

Hi @ingvagabund and all, thank you for the feedback!

I've included the output from kubectl get pods --namespace=kube-system kube-dns-XXX --output yaml (GKE: 1.15.12-gke.2 - this is the "stable" track in GCP):

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
    seccomp.security.alpha.kubernetes.io/pod: docker/default
  creationTimestamp: "2020-08-18T23:00:25Z"
  generateName: kube-dns-5c446b66bd-
  labels:
    k8s-app: kube-dns
    pod-template-hash: 5c446b66bd
  name: kube-dns-5c446b66bd-75x45
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: kube-dns-5c446b66bd
    uid: 71773671-3d53-4293-a7fe-b139a2e5270f
  resourceVersion: "1054522"
  selfLink: /api/v1/namespaces/kube-system/pods/kube-dns-5c446b66bd-75x45
  uid: 86c94bba-af29-4c75-a3ae-22aecfa1f573
spec:
  containers:
  - args:
    - --domain=cluster.local.
    - --dns-port=10053
    - --config-dir=/kube-dns-config
    - --v=2
    env:
    - name: PROMETHEUS_PORT
      value: "10055"
    image: k8s.gcr.io/k8s-dns-kube-dns-amd64:1.15.8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /healthcheck/kubedns
        port: 10054
        scheme: HTTP
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: kubedns
    ports:
    - containerPort: 10053
      name: dns-local
      protocol: UDP
    - containerPort: 10053
      name: dns-tcp-local
      protocol: TCP
    - containerPort: 10055
      name: metrics
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /readiness
        port: 8081
        scheme: HTTP
      initialDelaySeconds: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    resources:
      limits:
        memory: 170Mi
      requests:
        cpu: 100m
        memory: 70Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /kube-dns-config
      name: kube-dns-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-dns-token-jhn8f
      readOnly: true
  - args:
    - -v=2
    - -logtostderr
    - -configDir=/etc/k8s/dns/dnsmasq-nanny
    - -restartDnsmasq=true
    - --
    - -k
    - --cache-size=1000
    - --no-negcache
    - --dns-forward-max=1500
    - --log-facility=-
    - --server=/cluster.local/127.0.0.1#10053
    - --server=/in-addr.arpa/127.0.0.1#10053
    - --server=/ip6.arpa/127.0.0.1#10053
    image: k8s.gcr.io/k8s-dns-dnsmasq-nanny-amd64:1.15.8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /healthcheck/dnsmasq
        port: 10054
        scheme: HTTP
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: dnsmasq
    ports:
    - containerPort: 53
      name: dns
      protocol: UDP
    - containerPort: 53
      name: dns-tcp
      protocol: TCP
    resources:
      requests:
        cpu: 150m
        memory: 20Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/k8s/dns/dnsmasq-nanny
      name: kube-dns-config
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-dns-token-jhn8f
      readOnly: true
  - args:
    - --v=2
    - --logtostderr
    - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,SRV
    - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,SRV
    image: k8s.gcr.io/k8s-dns-sidecar-amd64:1.15.8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /metrics
        port: 10054
        scheme: HTTP
      initialDelaySeconds: 60
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    name: sidecar
    ports:
    - containerPort: 10054
      name: metrics
      protocol: TCP
    resources:
      requests:
        cpu: 10m
        memory: 20Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-dns-token-jhn8f
      readOnly: true
  - command:
    - /monitor
    - --source=kubedns:http://localhost:10054?whitelisted=probe_kubedns_latency_ms,probe_kubedns_errors,dnsmasq_misses,dnsmasq_hits
    - --stackdriver-prefix=container.googleapis.com/internal/addons
    - --api-override=https://monitoring.googleapis.com/
    - --pod-id=$(POD_NAME)
    - --namespace-id=$(POD_NAMESPACE)
    - --v=2
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    image: k8s.gcr.io/prometheus-to-sd:v0.4.2
    imagePullPolicy: IfNotPresent
    name: prometheus-to-sd
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-dns-token-jhn8f
      readOnly: true
  dnsPolicy: Default
  enableServiceLinks: true
  nodeName: gke-cluster0-nodepool0-fc744471-ppp6
  priority: 2000000000
  priorityClassName: system-cluster-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: kube-dns
  serviceAccountName: kube-dns
  terminationGracePeriodSeconds: 30
  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - key: components.gke.io/gke-managed-components
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
---8<--snip--8<---

So we can see:
metadata.annotations.scheduler.alpha.kubernetes.io/critical-pod = ""
spec.priority = 2000000000
spec.priorityClassName = system-cluster-critical

So I think I need to configure the priorityClass and also use the evict-system-critical-pods option I've started adding in #379?

I have no preference for how this functionality is provided so I am happy to comply to any suggestions, however, if this type of option is already provided (evict-storage-local-pods) then maybe it'd be OK to merge in my PR and then treat changing this group of options as a separate PR?

I've created a PR which essentially copies the evict-storage-local-pods pattern and if you're happy for us to proceed down this path then I'll add some tests.

@ingvagabund
Copy link
Contributor

I have no preference for how this functionality is provided so I am happy to comply to any suggestions, however, if this type of option is already provided (evict-storage-local-pods) then maybe it'd be OK to merge in my PR and then treat changing this group of options as a separate PR?

As long as both PRs are merged in the same release.

Based on the pod manifest I see you are running k8s v1.15.8. Are you in need to merge this soon to have it in 1.19? Or, would 1.20 be acceptable?

@roobert
Copy link
Author

roobert commented Aug 19, 2020

@ingvagabund - no rush from my perspective!

@damemi
Copy link
Contributor

damemi commented Aug 20, 2020

This is a reasonable use case, but like it was mentioned this could be dangerous to cluster stability if it's not carefully implemented and the effects are made clear to users: this will evict kube-system pods (really the only reason this is needed).

Given that risk and the fact that @roobert says there's no rush, I think it would be best to get this in for 1.20 (especially since we are trying to release 1.19 sometime next week). This way we can make sure there is thorough review and get input from more parties.

@roobert
Copy link
Author

roobert commented Aug 24, 2020

Hey, I've now had time to test this locally and found that another change is necessary for this to work, specifically: the maximum thresholdPriority value is 2000000000 which is the same as the kube-dns pod value and as such the pod always gets marked as non-evictable.

I'm not sure what the preferred way to work around this would be? Should we increase the maximum thresholdPriority value to 2000000001 or similar? Or would a flag to disable thresholdPriority checking be better? I tried setting thresholdPriorityClass to some fake class as a workaround but that failed for some other reason.

@ingvagabund
Copy link
Contributor

If you are allowed to evict a critical pod, you are allowed to evict any pod. So setting --evict-system-critical-pods will effectively disable check for any priority threshold.

@roobert
Copy link
Author

roobert commented Aug 24, 2020

@ingvagabund - is that right? I'll rerun some tests to double check my original findings!

@ingvagabund
Copy link
Contributor

@ingvagabund - is that right? I'll rerun some tests to double check my original findings!

I am talking about the future implementation as the expected solution. Unless, the error you are describing is produced by the Evict call itself.

@ingvagabund
Copy link
Contributor

ingvagabund commented Sep 20, 2020

@roobert are you still interesting in implementing the feature?

@ingvagabund
Copy link
Contributor

/good-first-issue

@k8s-ci-robot
Copy link
Contributor

@ingvagabund:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Sep 20, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2020
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2020
@RyanDevlin
Copy link
Contributor

If no one is still working on this I can give it a shot.

@ingvagabund
Copy link
Contributor

Feel free to pick it up :)

@RyanDevlin
Copy link
Contributor

/assign

@RyanDevlin
Copy link
Contributor

@ingvagabund you had mentioned in #380 "Keeping descheduling-interval flag as it is since it's not related to a particular strategy but to the overall controller configuration".

Following this logic and the conversation above, should the solution to this only be implemented as part of the descheduler policy configuration? Or should it be both a command line flag and part of the policy?

@ingvagabund
Copy link
Contributor

The less command line flags, the better. The problem with flags is they are not versioned. Policy is.

@RyanDevlin
Copy link
Contributor

Good point, in that case I'll implement it as a policy configuration.

@RyanDevlin
Copy link
Contributor

RyanDevlin commented Mar 1, 2021

Should enabling evictSystemCriticalPods cause mirror pods, static pods, daemonsets, or pods without ownerRefs to be evicted? Currently the, IsEvictable() function hard-codes some of these checks into the descheduler.

Or are we to assume that the only use case for this feature is to evict pods that have a critical priority, but also have ownerRefs, are not part of a daemonset, and are not mirror/static pods?

@ingvagabund
Copy link
Contributor

ingvagabund commented Mar 2, 2021

Every evictable pod has to be also re-creatable. Thus non-empty ownerRefs is a must. Static/mirror pods get scheduled to the same node (as they are stored locally on each node under /etc/kubernetes/manifests directory) so there's no point evicting those. Also, in many cases, these pods are core pods or pods supporting core pods. Similar holds for pods owned by a deamon set which will get scheduled to the same node they get evicted from. None of these cases are worth descheduling. There are more suitable for priority and preemption policies enforced by the scheduler which evicts lower priority pods to create space for higher priority pods. Descheduler's use case is to load balance pods to normalize resource usage among nodes (and some other cases which are not covered by priority and preemption policies).

Should enabling evictSystemCriticalPods cause mirror pods, static pods, daemonsets, or pods without ownerRefs to be evicted?

So the answer is no.

@RyanDevlin
Copy link
Contributor

@ingvagabund I've completed the work on this feature and opened a pull request here: #522

This is my first pull request to Kubernetes, so I'm happy to change things if I've submitted something incorrectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
8 participants