-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduling: Add new pod priority class #369
scheduling: Add new pod priority class #369
Conversation
cc @bbrowning I mentioned knative here as per the discussion on slack, feel free to leave suggestions for any other use cases you would have or if it was not clear enough, thanks! |
workloads, problem with that is that users can create priority classes that | ||
would schedule this in favour of the OpenShift workloads. | ||
|
||
[1]: https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note while 4.4 docs do show logging class (I linked to 3.11 as they were more true to what we have), this is not true out of the box on 4.6 at least, did not verify on earlier versions. https://docs.openshift.com/container-platform/4.4/nodes/pods/nodes-pods-priority.html#admin-guide-priority-preemption-priority-class_nodes-pods-priority
I am not sure if we should remove that after this is cleared up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits, looks great to me
cc @openshift/openshift-architects please take a look, thanks! |
|
||
## Design Details | ||
|
||
New priority class would be created by the component that creates the two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this provides a nice default.
if users required further differentiation among user-critical
workloads, we could explore an operator exposing an override for the default. this is only needed if demand was critical enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few additional questions:
- do we want to reserve a prefix like
openshift-
for priority class names that we control? - do we want to restrict the set of namespaces this priority can be used in?
upstream we have support for quota by priority class:
https://kubernetes.io/docs/concepts/policy/resource-quotas/#resource-quota-per-priorityclass
its possible we could restrict usage of a priority class without explicit quota in order to prevent consumption.
see:
https://kubernetes.io/docs/concepts/policy/resource-quotas/#limit-priority-class-consumption-by-default
its worth enumerating some pros/cons on the above as part of this design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions!
- Adding a prefix of
openshift-
to theuser-critical
class name would make it clear to users that this is reserved for user workloads that are managed by OpenShift, but it goes against the current pattern as no one of our existing classes have a reserved prefix. To change this would require to change approx 170+ instances of the existing class names. It might make sense here as we would name ituser-critical
to prefix withopenshift-user-critical
, so I am happy to go that route just for this one. - Seems like we do not limit the scope of namespaces for the current two existing classes, but this should be an easier fix to do and I think we should do this for all 3 (two existing and the new proposeed). The pros for this are that users don't use and abuse this reserved class. The cons are we might be breaking some things for our users that already consume this class? But also potentially some of our Red Hat components that are not installed in openshift-* namespaces, are we okay with doing that?
My vote would be to not change the existing classes, but apply the above to the new class, as we do not know what it might impact. Sounds good to you?
Will add to the document once we agree on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lilic I am confused by your comment.
Kubernetes reserves the system-* prefix for Kubernetes usage (see: https://github.com/kubernetes/kubernetes/blob/5238e1c80fea02891e1804b346d24faa7c13da07/pkg/apis/scheduling/validation/validation.go#L37)
The question I was asking is if OpenShift should reserve a similar openshift-* prefix for names that are unique to the distribution. We obviously shouldn't change names that are reserved upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am inclined to reserve the openshift-* name prefix for use by OpenShift.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed with Derek out of band, we think the best place would be openshift-apiserver or openshift-controller-manager.
was cluster-config-operator explicilty rejected in that conversation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not OpenShift-API related. So neither oa nor ocm make sense. I also tend to cluster-config-operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reply! The problem was not all components install cluster-config-operator, so would be a problem if a core component like monitoring uses this role but it's not there in some environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lilic if you get stuck on this i suggest putting it in the CMO itself for now and re-homing it when someone else wants to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bparees Sounds great, will do, thanks!
introducing a third priority class: `user-critical`. This would be used by any | ||
pods that are important for user facing OpenShift features but are not deemed | ||
system critical. Example of pods include user workload monitoring and user's | ||
Knative Service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd replace "user's Knative Service" here with "the OpenShift Serverless control and data planes". We don't want OpenShift Serverless control or data planes to be system-*-critical because it's an optional component and not critical to the core functioning of the cluster itself. But, we do want the OpenShift Serverless control and data plane pods to be a higher priority than user workloads because if OpenShift Serverless control or data plane pods get evicted then that degrades the functionality of all Knative workloads in the cluster.
As a general comment, I'd expect many optional operators will want to take advantage of this new priority. I don't think eviction is something widely tested today, but a number of optional operators (Service Mesh, Pipelines, CodeReady Workspaces, and so on) would end up in a bad place if certain operator pods got evicted before user workloads consuming features of those optional operators. |
@bbrowning Agreed, I do believe this is something that is missing, will add this to the proposal notes. |
ping @openshift/openshift-architects can we have an lgtm or are there open questions still? |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
3b54fb9
to
bcd388d
Compare
bcd388d
to
9bf51e5
Compare
@bparees thanks for ping on slack, I updated to reflect this should be ready for review and merge :) 🎉 |
/approve thanks for your persistence/patience on this @lilic :) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bparees The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you Ben! 🎉 |
This proposal sums up the discussion we had on slack about introducing a new priority class. As outlined in the proposal the reason we would like to have this is to make sure cluster monitoring pods get scheduled in favour of user workload monitoring pods, with user workload monitoring maybe going GA in 4.6 it would be good to solve it in 4.6.