KEP: Pod Scheduling Readiness #3522

Huang-Wei · 2022-09-16T18:31:06Z

One-line PR description: Initial KEP writeup for Pod Schedule Readiness

Issue link: Pod Scheduling Readiness #3521

Other comments:

/sig scheduling

ddebroy

This is really useful! A few comments/clarifications.

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

kerthcet · 2022-09-19T05:57:24Z

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

+As an advanced scheduler developer, I want to compose a series of scheduler PreEnqueue plugins
+to guard the schedulability of my Pods.


Can we list other real cases which also benefits from this extension point.

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

alculquicondor · 2022-09-19T22:18:33Z

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

+nitty-gritty.
+-->
+
+We propose a new field `.spec.schedulePaused` to the Pod API. The field is defaulted to `false`.


I was thinking (and I suggested it to @ahg-g offline just now) that it might be worth making this a list of strings, in a similar fashion to finalizers.

Once the list is empty, the pod is allowed to be scheduled.

This allows multiple controllers/webhooks to decide whether a pod can be scheduled based on different criteria.

It might be useful to have an endpoint just for this field (and disallow edition via general patch/update) so that there are less chances that a controller accidentally removes all the strings. Or maybe we can just have a simple update validation: only one string can be removed at a time.

Although, for finalizers we don't do that. The only related thing is an RBAC rule that allows controllers to just modify finalizers but nothing else.

I think it is a good idea, and I am wondering if we have any "regrets" related to the finalizers pattern that we need to consider when evaluating this API format, @liggitt @smarterclayton any thoughts?

@pwschuurman this should help simplify #3336

Thanks @alculquicondor for this idea. This looks more extensible, and we may name it
ScheduleReadinessGates []string.

I'm interested to learn from API experts if there are any traps in adopting this finalizers-like API pattern.

It's great, then each controller can have its own enqueue plugin if it needs.

We could also have creation default the pod Scheduled condition to false and reason WaitingForGate.

That's fair. Just to clarify: the condition "PodScheduled=WaitingForGate" is applied when it goes through the lightweight scheduling process, instead of defaulted by pkg/apis/core/<something>.

No, we can default it at pod creation.

seems we don't have precedence of defaulting a condition? (I checked pkg/registry/core/pod/strategy.go)

But, if API folks are fine with the pattern that at least save the scheduler's efforts.

We're adding a Kube feature, defaulting can do many things, while it's novel, it's certainly worth trying. If we come up with a reason to remove it before beta, that's low cost, but we do have defaulting of status (namespace status phase defaulting). I think it's a nice usability win, and we can record "will look for implications of novel condition defaulting for GA". We may only be able to do the defaulting during creation (can't do it during update), so that's a possible wrinkle. Could simply list it as a run-at for mitigation.

I think it's a nice usability win

+1. Already added it in the monitoring section.

keps/prod-readiness/sig-scheduling/3521.yaml

ahg-g · 2022-09-19T22:21:36Z

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

+nitty-gritty.
+-->
+
+We propose a new field `.spec.schedulePaused` to the Pod API. The field is defaulted to `false`.


I think it is a good idea, and I am wondering if we have any "regrets" related to the finalizers pattern that we need to consider when evaluating this API format, @liggitt @smarterclayton any thoughts?

pwschuurman · 2022-09-19T22:27:51Z

I have a use case as described in #3335, that could be alternatively achieved through this KEP.

A user wants to migrate a StatefulSet from across a Kubernetes cluster (or namespace). Since a StatefulSet object is a namespaced resource, in order to maintain at-most-one semantics during migration, if a pod is migrated to the new namespace, it shouldn't be able to start in the old namespace. In #3335, the approach is to modify the StatefulSet API to achieve this. An orchestrator that was aware of resources cross-cluster or cross-namespace would be responsible for manipulating which pods should be available to each StatefulSet.

However in this KEP, this behavior could be achieved by having the orchestrator set the schedulePaused boolean on a pod as needed (set by a mutating webhook that the orchestrator installs), which would prevent a pod from being scheduled in the previous cluster/namespace. By controlling which pods get scheduled in this manner and when, pods could be migrated in an ordered fashion.

Huang-Wei · 2022-09-20T01:24:15Z

@pwschuurman thanks for sharing the usecase. In your migration flow, do you need to manipulate the old scheduled statefulset to update its scheduledPaused to true? Or, you just create new replicaset with scheduledPaused=true, and migrate one by one?

pwschuurman · 2022-09-20T01:58:26Z

@pwschuurman thanks for sharing the usecase. In your migration flow, do you need to manipulate the old scheduled statefulset to update its scheduledPaused to true? Or, you just create new replicaset with scheduledPaused=true, and migrate one by one?

For the migration use case, I think the new StatefulSet can be created with scheduledPaused=true. Only the new pods go through true->false transition for scheduledPaused. For the old scheduled StatefulSet, the .spec.replicas can be scaled down by one (highest ordinal first), and .spec.scheduledPaused=true can be set the corresponding pod in the new StatefulSet (highest ordinal first).

To support rollback (reverse migration) though, pods will need to go through a false->true transition. As discussed in the KEP, there isn't a strong use case to support direct modification from false->true on an existing pod. To support this scenario for our use case though, the pod can be deleted (and .spec.scheduledPaused=true set through a webhook when the pod gets re-created).

Huang-Wei · 2022-09-20T04:43:43Z

To support rollback (reverse migration) though, pods will need to go through a false->true transition.

Good to know this scenario.

there isn't a strong use case to support direct modification from false->true on an existing pod
the pod can be deleted

SG, this fits the current consensus #3522 (comment) to prevent false->true transition for updating an existing Pod (you have to recreate the pod).

keps/prod-readiness/sig-scheduling/3521.yaml

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

Huang-Wei · 2022-09-23T01:14:01Z

@ahg-g @smarterclayton Updated the doc based on the comments:

reword schedulePaused to schedulingPaused
support set spec.nodeName without clearing schedulingPaused
add schedulingGates option into a UNRESOLVED section
add a story to support the needs of []string instead of a single boolean
update the version skew section
change the status from provisional to implementable

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

ahg-g · 2022-09-23T19:14:24Z

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

+    // This option will need to be accompanied with a yet-to-be-implemented fined-grained
+    // permission (https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4)
+    // and consistent with how finalizes/liens work today.
+    SchedulingGates []string


@smarterclayton it seems that now you are leaning towards this option vs a flag, correct? If so, we also need to discuss the format, we now have two:

similar to finalizers: actors add to the list and later remove from the same list

similar to readinessGates: actors add to the list in the spec, and later add to a parallel list in status

Regarding option 2, do we have concerns that it will require controllers to have update permissions on pod status? I guess not if we allow updating the list only through a dedicated endpoint since we want to enable fine grained permissions.

I like option 2 because it preserves the history of who decided that a pod is ready, which is useful for debugging and tracking feature usage.

Personally, I'm in favor of readinessGates as (a) we have pod readiness gates as precedence, and (b) each controller don't need to worry about updating the same API field, and thus keep some factors like ordering away. However, two aspects are still unresolved (#3522 (comment))

Permission. It's still unclear to me. The pod readiness gates KEP didn't mention about permission. So I really hope @smarterclayton and @liggitt can shed some light - as if we think permission is mandatory here, we have to make it right since day 1.

Restrictive condition transition. I suppose it's doable to enforce the condition that is transited from NotReady to Ready only. But reposted here for confirmation.

I need to read the arguments around pod readiness gates and the whole status thing. Scheduling happens before ownership is transitioned to a node, I would probably argue that like finalizers, all mutations happen within spec (i view metadata as generic spec in this scenario).

Anyone who can set spec.nodeName is already as powerful as someone who can set the proposed scheduling gates. OpenShift and some others had admission checks that limited who could mutate that field to effectively cluster admins (roughly the mutation of that field required an RBAC check you have to have the binding subresource permission of a scheduler).

Subresource permission is supported today, so adding a subresource is the easiest way, but we'd probably have to have special cases for what characteristic would be permission (i.e. a check on readiness/<name>), but it's not terribly difficult today to make the sub resource pods/readiness/<name>. The more we start bulk updating gates (for instance if the scheduler implements a set of gates via plugins and we want to bulk submit, vs letting individual actors coordinate) the more we get closer to just performing field level checks in an admission handler.

How would we model multiple gates coordinating in a single scheduler? I would very much dislike having multiple individual writes to the object, but anything that happens in a single process and is fairly quick should probably just be a single gate (even if it's implemented as a composite of steps inside the plugin).

Ultimately subresources work well when the use case is narrow and you don't need to subdivide clients excessively - this is starting to cross that line for binding if we expect lots of gates from a single client being set in a row.

The more we start bulk updating gates (for instance if the scheduler implements a set of gates via plugins and we want to bulk submit, vs letting individual actors coordinate) the more we get closer to just performing field level checks in an admission handler.

I don't think the scheduler itself will have plugins that update gates, the scheduler will only respect those gates (i.e., if they are present, it will simply not attempt to schedule the pod).

The gates will be updated by external controllers, like a queueing controller or a statefulset migration controller, those by design will likely be controllers that don't share memory and run as separate processes. I think the proper design pattern should be that a controller only handles/updates a single gate.

No matter if the field is a single boolean or a list of strings, a Pod's creation is the only chance to set the complete gating condition(s). For a single boolean, true represents the condition; while for a list of strings, it's highly customized list. After a Pod's creation, only "semantical subtraction" can be allowed: for boolean, it's true -> false; for a list of strings, it's "delete a particular element from the list" (it just behaves differently for finalizer-like list vs. readiness gates-like list)

I would very much dislike having multiple individual writes to the object

For the design of a list of strings, both options (finalizer-like and readiness gate-like) will inevictably yield N individual writes to a Pod - each controller needs to spawn an API call to relax the readiness gate anyway (either by deleting one entry from the finalizer-like lists, or updating their particular .status.conditions[x] to ready/true).

Hrm. In essence, we have a spectrum:

Initializers / finalizers - []string, removal only, additional data stored on object or at worst in meta, extremely unlikely we add supporting data

schedulingGates - removal only, additional data stored on pod spec, potentially other data in the future but unlkely, users unlikely to need the status of each gate, integrators likely able to edit spec

readinessGates - []struct + conditions, readiness is inherently reactionary and two-way, gates say they are one way but are not (and probably can't safely change conditions), it think it is possible to lose all node state in such a way that gates have to be re-executed (but we don't clearly define that anywhere), integrators SHOULD NOT be able to edit spec, users want to see the status of each gate

I think the argument I'm thinking about is that nothing requires conditions, therefore why are they a part of the flow. Consistency is one argument, but I think the use case is closer to initializers / finalizers, and since we don't have a need, I'd argue YAGNI. If we do get feedback that users want a lot of condition state from gates, we could reassess in beta (the structure wouldn't change for our spec).

Thoughts?

I agree with option 2, and I think even now integrators are free to add conditions to the pod if they wish to offer more status information, don't they? So there is no need to tie that to our feature.

Right, good argument.

Option 2 should work for most cases. If there is some case like re-scheduling a pod without deleting it, it's also extensible / backward-compatible to add fields.

ahg-g · 2022-10-04T14:56:01Z

@smarterclayton @Huang-Wei I guess all unresolved issues are now addressed, I think we settled on the API and the overall semantics. The deadline is in a couple of days :)

Huang-Wei · 2022-10-04T16:13:39Z

@ahg-g yup, I think all comments have been resolved. Once @smarterclayton and @wojtek-t give a green light, I will squash the commits.

smarterclayton · 2022-10-04T18:41:49Z

Doing a final pass now.

smarterclayton · 2022-10-04T18:54:39Z

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

+or not. This enables scheduler plugin developers to tryout the feature even in Alpha, by crafting
+different enqueue plugins and wire with custom fields or conditions.
+
+- **New column in kubectl:** To provide better UX, we're going to add a column "scheduling paused"


We already have a lot of columns, I just expected this to show up in the phase column (where we summarize the state of the pod). I don't want to specifically add another column unless we have a great reason, and this column will be empty for 99% of pods for 99.999% of their lifecycle. Can we update the text?

that's fair. Updated.

smarterclayton · 2022-10-04T18:57:39Z

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

+
+type PodSchedulingGate struct {
+    // Name of the scheduling gate.
+    Name string


Clarify here that name must be unique among gates? You say map earlier, but we don't clearly describe here that this must be unique among other gates.

wojtek-t · 2022-10-05T06:21:09Z

@Huang-Wei - the PRR will require more for Beta, but it's good enough for Alpha.

/approve PRR

/label tide/merge-method-squash

k8s-ci-robot · 2022-10-05T06:33:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-scheduling/OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2022-10-05T06:34:35Z

@ahg-g @smarterclayton - please LGTM it once you're ready

smarterclayton · 2022-10-05T16:54:34Z

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

+[fine-grained permissions](https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4)) 
+- Gather feedback from developers and out-of-tree plugins.
+- Benchmark tests passed, and there is no performance degradation.
+- Update documents to reflect the changes.


Identify whether gates can be added post-creation

smarterclayton · 2022-10-05T16:54:36Z

keps/sig-scheduling/3521-pod-schedule-readiness/README.md

+- Feature disabled by default.
+- Unit and integration tests completed and passed.
+- API strategy test (`pkg/registry/*`) to verify disabled fields when the feature gate is on/off.
+- Additional tests are in Testgrid and linked in KEP


Determine whether any additional state is required per gate

ahg-g · 2022-10-05T21:44:25Z

/lgtm
/hold

In case there are any last comments

smarterclayton · 2022-10-06T01:08:57Z

lgtm - excellent minimal proposal which can unlock extensibility and composition for schedulers and workloads.

wojtek-t · 2022-10-06T07:18:47Z

lgtm - excellent minimal proposal which can unlock extensibility and composition for schedulers and workloads.

+1

/hold cancel

Huang-Wei · 2022-10-06T16:11:44Z

Thanks to all reviewers for your dedicated time and valuable comments! Without that, I cannot make this KEP go through.

I'd particularly appreciate primary reviewers @ahg-g and @smarterclayton, for your continuous insightful comments to help shape the final KEP.

* First version of KEP: Pod Schedule Readiness * address comments including: - reword `schedulePaused` to `schedulingPaused` - support set spec.nodeName without clearing schedulingPaused - add `schedulingGates` into UNRESOLVED - add a story to support needs of array instead of a single boolean - update the version skew section - change the status from `provisional` to `implementable` * fixup: rewordings * fixup: major update * minor: polish wordings * address PRR comments * address PRR comments (cont.) * some rewordings * address comments

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 16, 2022

k8s-ci-robot requested review from ahg-g and alculquicondor September 16, 2022 18:31

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 16, 2022

Huang-Wei changed the title ~~First version of KEP: Pod Schedule Readiness~~ KEP: Pod Schedule Readiness Sep 16, 2022

Huang-Wei mentioned this pull request Sep 16, 2022

Pod Scheduling Readiness #3521

Closed

ddebroy reviewed Sep 16, 2022

View reviewed changes

keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved

keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved

smarterclayton reviewed Sep 16, 2022

View reviewed changes

keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved

smarterclayton reviewed Sep 16, 2022

View reviewed changes

keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved

ahg-g reviewed Sep 16, 2022

View reviewed changes

kerthcet reviewed Sep 19, 2022

View reviewed changes

alculquicondor reviewed Sep 19, 2022

View reviewed changes

ahg-g reviewed Sep 19, 2022

View reviewed changes

Huang-Wei force-pushed the kep/ready-to-schedule branch from 949c54e to 27398cd Compare September 20, 2022 23:36

ahg-g reviewed Sep 21, 2022

View reviewed changes

keps/prod-readiness/sig-scheduling/3521.yaml Show resolved Hide resolved

khenidak reviewed Sep 21, 2022

View reviewed changes

keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved

khenidak reviewed Sep 21, 2022

View reviewed changes

keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved

Huang-Wei force-pushed the kep/ready-to-schedule branch 2 times, most recently from 77019cf to 4f01ee3 Compare September 23, 2022 01:12

ahg-g reviewed Sep 23, 2022

View reviewed changes

Huang-Wei force-pushed the kep/ready-to-schedule branch from 089dc31 to db63c8c Compare September 24, 2022 04:20

kerthcet mentioned this pull request Sep 26, 2022

Add a new function Enqueue to QueueSort Plugin kubernetes/kubernetes#93591

Closed

address PRR comments

e3b7d3e

Huang-Wei force-pushed the kep/ready-to-schedule branch from c82765e to e3b7d3e Compare October 3, 2022 18:08

address PRR comments (cont.)

0efa2da

smarterclayton reviewed Oct 4, 2022

View reviewed changes

some rewordings

89ff2d3

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 5, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2022

smarterclayton reviewed Oct 5, 2022

View reviewed changes

address comments

b347549

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 5, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 6, 2022

k8s-ci-robot merged commit fac005a into kubernetes:master Oct 6, 2022

k8s-ci-robot added this to the v1.26 milestone Oct 6, 2022

Huang-Wei deleted the kep/ready-to-schedule branch October 6, 2022 16:08

Huang-Wei changed the title ~~KEP: Pod Schedule Readiness~~ KEP: Pod Scheduling Readiness Oct 6, 2022

x13n mentioned this pull request Oct 28, 2022

Pod scheduling readiness support in Cluster Autoscaler kubernetes/autoscaler#5279

Closed

		As an advanced scheduler developer, I want to compose a series of scheduler PreEnqueue plugins
		to guard the schedulability of my Pods.

KEP: Pod Scheduling Readiness #3522

KEP: Pod Scheduling Readiness #3522

Conversation

Huang-Wei commented Sep 16, 2022

ddebroy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwschuurman commented Sep 19, 2022

Huang-Wei commented Sep 20, 2022

pwschuurman commented Sep 20, 2022

Huang-Wei commented Sep 20, 2022

Huang-Wei commented Sep 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smarterclayton Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Oct 4, 2022

Huang-Wei commented Oct 4, 2022

smarterclayton commented Oct 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Oct 5, 2022

k8s-ci-robot commented Oct 5, 2022

wojtek-t commented Oct 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Oct 5, 2022

smarterclayton commented Oct 6, 2022

wojtek-t commented Oct 6, 2022

Huang-Wei commented Oct 6, 2022

smarterclayton Sep 23, 2022 •

edited

Loading