-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Pod Scheduling Readiness #3522
KEP: Pod Scheduling Readiness #3522
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really useful! A few comments/clarifications.
As an advanced scheduler developer, I want to compose a series of scheduler PreEnqueue plugins | ||
to guard the schedulability of my Pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we list other real cases which also benefits from this extension point.
nitty-gritty. | ||
--> | ||
|
||
We propose a new field `.spec.schedulePaused` to the Pod API. The field is defaulted to `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking (and I suggested it to @ahg-g offline just now) that it might be worth making this a list of strings, in a similar fashion to finalizers.
Once the list is empty, the pod is allowed to be scheduled.
This allows multiple controllers/webhooks to decide whether a pod can be scheduled based on different criteria.
It might be useful to have an endpoint just for this field (and disallow edition via general patch/update) so that there are less chances that a controller accidentally removes all the strings. Or maybe we can just have a simple update validation: only one string can be removed at a time.
Although, for finalizers we don't do that. The only related thing is an RBAC rule that allows controllers to just modify finalizers but nothing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a good idea, and I am wondering if we have any "regrets" related to the finalizers pattern that we need to consider when evaluating this API format, @liggitt @smarterclayton any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pwschuurman this should help simplify #3336
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alculquicondor for this idea. This looks more extensible, and we may name it
ScheduleReadinessGates []string
.
I'm interested to learn from API experts if there are any traps in adopting this finalizers-like API pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great, then each controller can have its own enqueue plugin if it needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also have creation default the pod Scheduled condition to false and reason WaitingForGate.
That's fair. Just to clarify: the condition "PodScheduled=WaitingForGate" is applied when it goes through the lightweight scheduling process, instead of defaulted by pkg/apis/core/<something>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can default it at pod creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we don't have precedence of defaulting a condition? (I checked pkg/registry/core/pod/strategy.go)
But, if API folks are fine with the pattern that at least save the scheduler's efforts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're adding a Kube feature, defaulting can do many things, while it's novel, it's certainly worth trying. If we come up with a reason to remove it before beta, that's low cost, but we do have defaulting of status (namespace status phase defaulting). I think it's a nice usability win, and we can record "will look for implications of novel condition defaulting for GA". We may only be able to do the defaulting during creation (can't do it during update), so that's a possible wrinkle. Could simply list it as a run-at for mitigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a nice usability win
+1. Already added it in the monitoring section.
nitty-gritty. | ||
--> | ||
|
||
We propose a new field `.spec.schedulePaused` to the Pod API. The field is defaulted to `false`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a good idea, and I am wondering if we have any "regrets" related to the finalizers pattern that we need to consider when evaluating this API format, @liggitt @smarterclayton any thoughts?
I have a use case as described in #3335, that could be alternatively achieved through this KEP. A user wants to migrate a StatefulSet from across a Kubernetes cluster (or namespace). Since a StatefulSet object is a namespaced resource, in order to maintain at-most-one semantics during migration, if a pod is migrated to the new namespace, it shouldn't be able to start in the old namespace. In #3335, the approach is to modify the StatefulSet API to achieve this. An orchestrator that was aware of resources cross-cluster or cross-namespace would be responsible for manipulating which pods should be available to each StatefulSet. However in this KEP, this behavior could be achieved by having the orchestrator set the |
@pwschuurman thanks for sharing the usecase. In your migration flow, do you need to manipulate the old scheduled statefulset to update its scheduledPaused to true? Or, you just create new replicaset with scheduledPaused=true, and migrate one by one? |
For the migration use case, I think the new StatefulSet can be created with scheduledPaused=true. Only the new pods go through true->false transition for scheduledPaused. For the old scheduled StatefulSet, the To support rollback (reverse migration) though, pods will need to go through a false->true transition. As discussed in the KEP, there isn't a strong use case to support direct modification from false->true on an existing pod. To support this scenario for our use case though, the pod can be deleted (and |
Good to know this scenario.
SG, this fits the current consensus #3522 (comment) to prevent false->true transition for updating an existing Pod (you have to recreate the pod). |
949c54e
to
27398cd
Compare
77019cf
to
4f01ee3
Compare
@ahg-g @smarterclayton Updated the doc based on the comments:
|
// This option will need to be accompanied with a yet-to-be-implemented fined-grained | ||
// permission (https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4) | ||
// and consistent with how finalizes/liens work today. | ||
SchedulingGates []string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smarterclayton it seems that now you are leaning towards this option vs a flag, correct? If so, we also need to discuss the format, we now have two:
- similar to finalizers: actors add to the list and later remove from the same list
- similar to readinessGates: actors add to the list in the spec, and later add to a parallel list in status
Regarding option 2, do we have concerns that it will require controllers to have update permissions on pod status? I guess not if we allow updating the list only through a dedicated endpoint since we want to enable fine grained permissions.
I like option 2 because it preserves the history of who decided that a pod is ready, which is useful for debugging and tracking feature usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'm in favor of readinessGates
as (a) we have pod readiness gates as precedence, and (b) each controller don't need to worry about updating the same API field, and thus keep some factors like ordering away. However, two aspects are still unresolved (#3522 (comment))
- Permission. It's still unclear to me. The pod readiness gates KEP didn't mention about permission. So I really hope @smarterclayton and @liggitt can shed some light - as if we think permission is mandatory here, we have to make it right since day 1.
- Restrictive condition transition. I suppose it's doable to enforce the condition that is transited from NotReady to Ready only. But reposted here for confirmation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to read the arguments around pod readiness gates and the whole status thing. Scheduling happens before ownership is transitioned to a node, I would probably argue that like finalizers, all mutations happen within spec (i view metadata as generic spec in this scenario).
Anyone who can set spec.nodeName is already as powerful as someone who can set the proposed scheduling gates. OpenShift and some others had admission checks that limited who could mutate that field to effectively cluster admins (roughly the mutation of that field required an RBAC check you have to have the binding subresource permission of a scheduler).
Subresource permission is supported today, so adding a subresource is the easiest way, but we'd probably have to have special cases for what characteristic would be permission (i.e. a check on readiness/<name>
), but it's not terribly difficult today to make the sub resource pods/readiness/<name>
. The more we start bulk updating gates (for instance if the scheduler implements a set of gates via plugins and we want to bulk submit, vs letting individual actors coordinate) the more we get closer to just performing field level checks in an admission handler.
How would we model multiple gates coordinating in a single scheduler? I would very much dislike having multiple individual writes to the object, but anything that happens in a single process and is fairly quick should probably just be a single gate (even if it's implemented as a composite of steps inside the plugin).
Ultimately subresources work well when the use case is narrow and you don't need to subdivide clients excessively - this is starting to cross that line for binding if we expect lots of gates from a single client being set in a row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more we start bulk updating gates (for instance if the scheduler implements a set of gates via plugins and we want to bulk submit, vs letting individual actors coordinate) the more we get closer to just performing field level checks in an admission handler.
I don't think the scheduler itself will have plugins that update gates, the scheduler will only respect those gates (i.e., if they are present, it will simply not attempt to schedule the pod).
The gates will be updated by external controllers, like a queueing controller or a statefulset migration controller, those by design will likely be controllers that don't share memory and run as separate processes. I think the proper design pattern should be that a controller only handles/updates a single gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No matter if the field is a single boolean or a list of strings, a Pod's creation is the only chance to set the complete gating condition(s). For a single boolean, true
represents the condition; while for a list of strings, it's highly customized list. After a Pod's creation, only "semantical subtraction" can be allowed: for boolean, it's true -> false; for a list of strings, it's "delete a particular element from the list" (it just behaves differently for finalizer-like list vs. readiness gates-like list)
I would very much dislike having multiple individual writes to the object
For the design of a list of strings, both options (finalizer-like and readiness gate-like) will inevictably yield N individual writes to a Pod - each controller needs to spawn an API call to relax the readiness gate anyway (either by deleting one entry from the finalizer-like lists, or updating their particular .status.conditions[x] to ready/true).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm. In essence, we have a spectrum:
- Initializers / finalizers -
[]string
, removal only, additional data stored on object or at worst in meta, extremely unlikely we add supporting data - schedulingGates - removal only, additional data stored on pod spec, potentially other data in the future but unlkely, users unlikely to need the status of each gate, integrators likely able to edit spec
- readinessGates -
[]struct + conditions
, readiness is inherently reactionary and two-way, gates say they are one way but are not (and probably can't safely change conditions), it think it is possible to lose all node state in such a way that gates have to be re-executed (but we don't clearly define that anywhere), integrators SHOULD NOT be able to edit spec, users want to see the status of each gate
I think the argument I'm thinking about is that nothing requires conditions, therefore why are they a part of the flow
. Consistency is one argument, but I think the use case is closer to initializers / finalizers, and since we don't have a need, I'd argue YAGNI. If we do get feedback that users want a lot of condition state from gates, we could reassess in beta (the structure wouldn't change for our spec).
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with option 2, and I think even now integrators are free to add conditions to the pod if they wish to offer more status information, don't they? So there is no need to tie that to our feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, good argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 2 should work for most cases. If there is some case like re-scheduling a pod without deleting it, it's also extensible / backward-compatible to add fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
089dc31
to
db63c8c
Compare
c82765e
to
e3b7d3e
Compare
@smarterclayton @Huang-Wei I guess all unresolved issues are now addressed, I think we settled on the API and the overall semantics. The deadline is in a couple of days :) |
@ahg-g yup, I think all comments have been resolved. Once @smarterclayton and @wojtek-t give a green light, I will squash the commits. |
Doing a final pass now. |
or not. This enables scheduler plugin developers to tryout the feature even in Alpha, by crafting | ||
different enqueue plugins and wire with custom fields or conditions. | ||
|
||
- **New column in kubectl:** To provide better UX, we're going to add a column "scheduling paused" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a lot of columns, I just expected this to show up in the phase column (where we summarize the state of the pod). I don't want to specifically add another column unless we have a great reason, and this column will be empty for 99% of pods for 99.999% of their lifecycle. Can we update the text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fair. Updated.
|
||
type PodSchedulingGate struct { | ||
// Name of the scheduling gate. | ||
Name string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify here that name must be unique among gates? You say map earlier, but we don't clearly describe here that this must be unique among other gates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@Huang-Wei - the PRR will require more for Beta, but it's good enough for Alpha. /approve PRR /label tide/merge-method-squash |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ahg-g @smarterclayton - please LGTM it once you're ready |
[fine-grained permissions](https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4)) | ||
- Gather feedback from developers and out-of-tree plugins. | ||
- Benchmark tests passed, and there is no performance degradation. | ||
- Update documents to reflect the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Identify whether gates can be added post-creation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
- Feature disabled by default. | ||
- Unit and integration tests completed and passed. | ||
- API strategy test (`pkg/registry/*`) to verify disabled fields when the feature gate is on/off. | ||
- Additional tests are in Testgrid and linked in KEP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Determine whether any additional state is required per gate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
/lgtm In case there are any last comments |
lgtm - excellent minimal proposal which can unlock extensibility and composition for schedulers and workloads. |
+1 /hold cancel |
Thanks to all reviewers for your dedicated time and valuable comments! Without that, I cannot make this KEP go through. I'd particularly appreciate primary reviewers @ahg-g and @smarterclayton, for your continuous insightful comments to help shape the final KEP. |
* First version of KEP: Pod Schedule Readiness * address comments including: - reword `schedulePaused` to `schedulingPaused` - support set spec.nodeName without clearing schedulingPaused - add `schedulingGates` into UNRESOLVED - add a story to support needs of array instead of a single boolean - update the version skew section - change the status from `provisional` to `implementable` * fixup: rewordings * fixup: major update * minor: polish wordings * address PRR comments * address PRR comments (cont.) * some rewordings * address comments
/sig scheduling