-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Permit descheduling of critical pods #378
Comments
Currently, the only way to evict system-critical pods is adding annotation I think it's reasonable to add another flag |
@roobert and @lixiang233 this seems reasonable to me. @damemi and @ingvagabund what to you think? |
@roobert by a critical pod you mean a pod whose priority class is system-critical (2000000000) ? Or given by IsCriticalPod (including mirror/static pod)? |
I would like to avoid introducing a new flag. At the same time it will be bothersome to have set the option for every strategy. I don't see it as a blocking issue though eventually it will be more practical to move all the relevant flags (node selector, evict with local storage) under a versioned config. If you need this really soon (as part of 1.19 release), I am fine with going ahead and adding the flag though there's a high chance the flag will get deprecated soon (e.g. in 1.20). |
Or, you might put the new flag under DeschedulerPolicy. E.g.:
The same could be done for the node selector and evict-local-storage-pods bits. |
#380 for promoting some flags into v1alpha1 descheduler policy fields |
Hi @ingvagabund and all, thank you for the feedback! I've included the output from
So we can see: So I think I need to configure the priorityClass and also use the I have no preference for how this functionality is provided so I am happy to comply to any suggestions, however, if this type of option is already provided ( I've created a PR which essentially copies the |
As long as both PRs are merged in the same release. Based on the pod manifest I see you are running k8s v1.15.8. Are you in need to merge this soon to have it in 1.19? Or, would 1.20 be acceptable? |
@ingvagabund - no rush from my perspective! |
This is a reasonable use case, but like it was mentioned this could be dangerous to cluster stability if it's not carefully implemented and the effects are made clear to users: this will evict kube-system pods (really the only reason this is needed). Given that risk and the fact that @roobert says there's no rush, I think it would be best to get this in for 1.20 (especially since we are trying to release 1.19 sometime next week). This way we can make sure there is thorough review and get input from more parties. |
Hey, I've now had time to test this locally and found that another change is necessary for this to work, specifically: the maximum I'm not sure what the preferred way to work around this would be? Should we increase the maximum |
If you are allowed to evict a critical pod, you are allowed to evict any pod. So setting |
@ingvagabund - is that right? I'll rerun some tests to double check my original findings! |
I am talking about the future implementation as the expected solution. Unless, the error you are describing is produced by the |
@roobert are you still interesting in implementing the feature? |
/good-first-issue |
@ingvagabund: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
If no one is still working on this I can give it a shot. |
Feel free to pick it up :) |
/assign |
@ingvagabund you had mentioned in #380 "Keeping Following this logic and the conversation above, should the solution to this only be implemented as part of the descheduler policy configuration? Or should it be both a command line flag and part of the policy? |
The less command line flags, the better. The problem with flags is they are not versioned. Policy is. |
Good point, in that case I'll implement it as a policy configuration. |
Should enabling Or are we to assume that the only use case for this feature is to evict pods that have a critical priority, but also have ownerRefs, are not part of a daemonset, and are not mirror/static pods? |
Every evictable pod has to be also re-creatable. Thus non-empty
So the answer is no. |
@ingvagabund I've completed the work on this feature and opened a pull request here: #522 This is my first pull request to Kubernetes, so I'm happy to change things if I've submitted something incorrectly. |
We scale down our dev and test GKE clusters to 0 nodes for roughly 8 hours a day - over a month this is roughly a 50% cost saving.
The problem that we have is when we scale the clusters back up, kube-dns often gets multiple instances scheduled to the same node which can often prevent our other services being scheduled properly. This is because we use the smallest possible nodes, again, for cost saving reasons.
This is a known issue: kubernetes/kubernetes#52193 and was at one point solved with antiAffinity rules being added, however, the rules caused performance issues and were eventually removed.
Since the changes have been removed from kubernetes core and we can't modify the configuration of kube-system pods in GKE ourselves, the only other option seems to be to use descheduler, however, in the descheduler docs under "Pod Evictions" it says that critical pods can never be evicted: https://github.com/kubernetes-sigs/descheduler#pod-evictions
I can understand the reason why this is a sensible default but was wondering if the project would be open to a patch to allow this to be configurable?
The text was updated successfully, but these errors were encountered: