-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MultiKueue] AdmissionCheck conditions are confusing #3094
Comments
Unfortunately, at this point, the conditions are not just about MK, there could be any (external) admission check that uses them. Granted, the current messages read as if there is a problem, when there is none. We could start by providing better messages. Another question comes to mind: what if there is one admission check that is ok with flavor independence and one that is not? Wouldn't both try to change the condition, rendering it unusable? |
Could we deprecate them, and drop as we move the API to Beta? I think it is ok as Alpha API doesn't guarantee backwards compatibility.
Exactly, this is what the user was concerned about -was searching if the CQ configuration is wrong being confused by the status, while the issue was with Kueue not restarted after JobSet install (another issue). |
Probably yes, but I haven't checked yet, can check, but kinda hoping we can drop them all together. |
The constants are in the beta API package though. Forgetting about backwards compatibility for a second: @trasc, could we move the conditions to the AdmissionCheckStatus instead? |
Right, but even Beta does not need to be backwards-compatible if we have a justification, looking at this: https://kubernetes.io/docs/reference/using-api/#api-versioning: "The schema and/or semantics of objects may change in incompatible ways in a subsequent beta or stable API version. When this happens, migration instructions are provided. Adapting to a subsequent beta or stable API version may require editing or re-creating API objects, and may not be straightforward. The migration may require downtime for applications that rely on the feature." On top of that, I highly doubt any users would want to use them outside of MultiKueue, and MK is still alpha as a feature. |
The intention with them (at least for SingleInstance) was to allow the cluster queue to detect the miss configuration and mark it'self as Inactive. We could in theory do it but I belive it's harder for the user to understand what is actually happening. The queue will look healthy but none of it's workload get admitted and need to check at least one workload to get the reason. (depending on when we do the AdmissionCheckStatus update we can also end up putting extra pressure on the scheduler to reserve quota just to be given back ). |
/cc @mwielgus |
It would relieve the confusion a bit, so +1 if we need to keep them in some form, but I would suggest dropping entirely.
When the CQ is mis-configured already convey this information via the ClusterQueue "Active=False" condition reason (which is determined here, and currently redundant in case of mis-configuration. It would be nice to expand the human-readable message though here. |
Opened #3099 to follow up on the discussion and improve debuggability. |
Just to make sure we are in the same page, I'm talking about https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/#kueue-x-k8s-io-v1beta1-AdmissionCheckStatus The way I'm thinking about is that the AdmissionCheckStatus would have the |
Status not State :), they should be there, I think the description of the issue is wrong. |
That is possible, I got the snippet from a confused user asking if this is indication of a problem and I assumed it is in CQ. I will double check and update the description |
I see, in that case, maybe the solution is simpler: instead of making it a "condition", they can be dedicated fields, maybe called "properties" or something along those lines. |
Custom field was one of the proposals back then, but we chose the condition to avoid API changes. However if we have concensus, it sure can be done. |
True, I remember some details. |
Updated, these are AdmissionCheck status conditions. I'm not sure the user wants to see the properties too. I think this was just redherring as it looks suspicious in the current form. IIRC they were introduced to provide a mechanism to express "don't set two instances of MultiKueue AC in a single CQ". I would suggest we drop them entirely and validate the conditions based on hard-coded rules (for the built-in ACs). if we want to have the dependencies configurable I think we need some form of API for it, and thus probably KEP update, so I'm hoping we can defer this for later, when working on "Allow validation of custom rules for AdmissionChecks". EDIT: conceptually, I think, the declarative API, would rather be on AdmissionCheck type (like MultiKueue) rather than on a concrete instance of AdmissionCheck. One place we could express this is a global configuration. Maybe there we could have some "properties" like |
It was added to say "don't set two instances of ACs managed by the same controller in a single CQ" and lately we have "don't set this AC at flavor level".
It should be about the AC controller and we should add an API object containing this, it was also proposed and rejected one of the reason being the need of user configuration. |
Right, assuming we need to support the validation rule for custom admission checks. My suggestion is that we could start with support for built-in ACs, and support custom ones in the future KEP, where we plan it properly. |
+1 on @mimowo as a lower effort solution (from a process' perspective) |
This evolving features step-by-step sound good to me. Anyway, explaining the step and planning in the KEP would be worth it. |
In my opinion restricting the usage for a hard-coded list of built-in ACCs is a step back, I'd prefer to start a discussion around a list of "proprieties" as mentioned here: #3094 (comment) . |
I believe we should proceed with deleting them for now, especially since, we remove them anyway in the ideal solution (replacing with the declarative API). I'm happy to open a dedicated issue to support some declarative API to drive validation rules for external controllers. However, given our current workstreams, it's unlikely we'll be able to properly review this before the end of November. Additionally, we haven't received any user or customer feedback requesting support for validation rules for external jobs, so I think its priority remains low until we hear otherwise. Therefore, I don't think we should block the graduation of MultiKueue on this. We can incrementally add the API later. It would be great if we could promote MultiKueue in version 0.9. Opened: #3106 |
Why should we have distinctive approaches for internal and external ACCs? |
This is not a MultiKueue problem is an AdmissionCheck one. (we should probably retitle the issue). |
Due to amount of work needed to support external vs internal ones. Note that, for the regular Jobs (no multiKueue), Kueue was only supporting built-in Jobs (just batch/Job) for a couple of first releases. Another differentiating factor is priority based on user feedback / requirements.
Is is only MultiKueue which uses them. The users reported the issue when experimenting with MultiKueue, I think it is fair to stay this way. |
Today @mimowo spent 20 minutes explaining this thing too me. I finally got what was the point of the conditions, but i'm afraid the end user will have even bigger problems to understand them. Given that, I would suggest deprecation and removal of these conditions. |
Another motivation to drop the condition noticed while working on the messages in #3127 (comment). I find it very hard to construct good, informative messages, with the current mechanism. To move forward for dropping them I see two possibilities:
While (2.) is backwards compatible, I’m leaning towards option (1.) because I believe it is highly unlikely anyone would use them by now. Second, they only support validation, so the happy path isn’t affected. |
@trasc I don't have a huge expertise in these area but both @mwielgus and @mimowo explained it to me and @alculquicondor seems to also be on-board so if my voice has any weight at all I'm +1 for what @mimowo's proposing. |
@mimowo I would recommend introducing the feature gate for those conditions. As the K8s deprecation guide, we need to provide the proper migration ways, and the AdmissionCheck API is assumed to be used by third-party AdmissionCheck controllers. So, my recommendation to mitigate the painful points for the existing users is to introduce the dedicated feature gate and disable the feature by default. Then, we take the proper deprecation steps. For example, we keep the feature gate for a while like a year, or bump AdmissionCheck API version to v1beta2. |
@tenzen-y IIUC the approach (3.) you are suggesting is to introduce a feature gate which controls the logic for the conditions: adding them and supporting, disabled by default and deprecated. I think we could drop the feature gate regardless of moving to v1beta2, because the conditions don't impact the schema, so we could say "drop the feature gate in 0.10 or 0.11, or so). EDIT: the release note could say: "ACTION REQUIRED: the Admission check status conditions “FlavorIndependent” and “SingleInstanceInClusterQueue” are no longer supported by default, If you were using any of these conditions for your external AdmissionCheck you need to enable the This sounds very reasonable to me, for me this is equally preferred as (1.). Let's hear what others think. |
If the rephrase of the conditions messages it not sufficient we could:
In a follow-up:
|
Yes, that is what I wanted to say. Thank you for expanding. It might be better to deprecate and disable it by default in v0.9 and then completely remove it in v0.10.0. Considering planning to release v0.9 in October, we can give them a reprieve for 5 or 6 months (about half a year). WDYT? |
If we decide to release v0.10 more early, we want to postpone the removal of the feature to v0.11.0 so that we can give them a reprieve for half a year. |
sgtm |
/assign |
What happened:
I got feedback from a user running into an issue with MultiKueue, and getting distracted by confusing AdmissionCheck conditions, and wasting time during investigations.
What you expected to happen:
No user-facing conditions used by MultiKueue for its inner-workings.
How to reproduce it (as minimally and precisely as possible):
Setup MultiKueue according to https://kueue.sigs.k8s.io/docs/tasks/manage/setup_multikueue/.
Anything else we need to know?:
The mechanism was introduced as part of Single instance admission check.(#1635), but the solution based on status conditions was contested and deferred to be revisted on future releases #1635 (comment).
I propose to:
The text was updated successfully, but these errors were encountered: