-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KEP for job tracking in implementable status #2308
Conversation
f6b9b56
to
100a5be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I think the downgrade scenario is the only important comment.
keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md
Outdated
Show resolved
Hide resolved
// UncountedPodUIDs holds UIDs of Pods that have finished but haven't been | ||
// accounted in Job status counters. | ||
type UncountedPodUIDs struct { | ||
// Enabled indicates whether Job tracking uses this struct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just make the UncountedPodUIDs
field optional, then you don't need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like the following?
type JobStatus struct {
// +optional
UncountedPodUIDs *UncountedPodUIDs
}
Would that differentiate nil
from empty struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But after some more thought, using an explicit "Enabled" field is probably clearer, so I guess let's keep it how you have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't like about the boolean is that it will eventually be useless. And the struct is only supposed to be handled by the Job controller, so I think I'm fine with the pointer if we can differentiate between nil
and empty struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for pointer, this is only meant for tracking status and when not used it'll be just empty which is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to pointer.
keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md
Outdated
Show resolved
Hide resolved
them afterwards. The controller sends a status update. | ||
|
||
In the case where a user or another controller removes a Pod, which sets a | ||
deletion timestamp, the Job controller removes the finalizer. This is along with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow what you're trying to say here? It's not wrong, I just don't understand why it needs to be said here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to its own section and added an unresolved note.
might not have a finalizer. We can: | ||
1. Prevent adoption if the Pod doesn't have the finalizer. This is a breaking | ||
change. | ||
1. Add the finalizer in the same patch request that modifies the owner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should go with this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - I don't see any disadvantages of this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of incremental changes. 1 would be easier, so I can do it in the first implementation PR. If I have time until the codefreeze, I can do 2. But if I miss the deadline, then I think it's fine to leave it for beta. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be much harder than 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed unresolved taking option 2.
`batch.kubernetes.io/job-completion` finalizer. It would be hard to add the | ||
finalizer to all Pods while preventing race conditions. | ||
|
||
We use `.status.uncountedPodUIDs.enabled` to indicate whether the Job was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we roll back to a version of the job controller that doesn't know about the finalizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what happens when we upgrade again :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would cause many issues. There is nothing that can remove the finalizers.
So my answer would be: do not enable the feature gate in production clusters while it's in alpha. Disabling the feature gate once the feature is enabled by default should be fine, as the controller (and an N-1 version of it) knows how to remove finalizers.
Is this acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I think you might need to roll out in two releases; the first release will remove the finalizer if it is present, but doesn't add it. The second release can then start adding the finalizer and implement the rest of the feature.
The alternative is to do both of a) turning off the feature flag won't disable the finalizer removal logic and b) give users a command/script which they can use to manually remove all finalizers in the event of a downgrade.
If you're going to have an alpha release anyway, then I think it's fine to omit b) since the vast majority of users won't be in this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the alternative. Is there a directory where we already have scripts like the one described?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the release notes is probably the place to put it. But if it's just alpha in the first release, probably no need.
@liggitt may have an opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turning off the feature flag won't disable the finalizer removal logic
I haven't read the full context, but auto-added finalizers not being removed on rollback or failed HA upgrade or feature disable is a really nasty failure mode. Ungated logic to clean up a finalizer starting in alpha stage could be reasonable. This is similar to what was done with the kubernetes.io/pvc-protection
finalizer in kubernetes/kubernetes#61324 / kubernetes/kubernetes#61370
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ungated logic to clean up a finalizer starting in alpha stage could be reasonable.
That is what's being proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Daniel - that looks reasonable to me.
might not have a finalizer. We can: | ||
1. Prevent adoption if the Pod doesn't have the finalizer. This is a breaking | ||
change. | ||
1. Add the finalizer in the same patch request that modifies the owner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - I don't see any disadvantages of this option.
keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md
Outdated
Show resolved
Hide resolved
`batch.kubernetes.io/job-completion` finalizer. It would be hard to add the | ||
finalizer to all Pods while preventing race conditions. | ||
|
||
We use `.status.uncountedPodUIDs.enabled` to indicate whether the Job was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what happens when we upgrade again :)
18ff62b
to
b7c89e8
Compare
keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md
Outdated
Show resolved
Hide resolved
keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md
Outdated
Show resolved
Hide resolved
3. The Job controller removes the `batch.kubernetes.io/job-completion` finalizer | ||
from all Pods on Succeeded or Failed phase. We can assume that all of them | ||
have been added to the lists if the previous step succeeded. The controller | ||
keeps note of the Pod updates that succedeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not all since you are planning to do them in batches. we will remove the ones we added to the lists in the previous step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted
- had no finalizer at the beginning of the sync, or | ||
- had the patch in step 3 succeed, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is implementation detail, the algorithm is: "count pods listed in uncountedPodUIDs that either have no finalizer or deleted"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see harm in the detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes the detail add confusion rather than clarity. As a reader, I paused a little reading this wondering why are we distinguishing between the two. When I read a proposal, I expect to read about the semantics or behavior we are trying to achieve, if there is uncertainly on how to achieve that behavior, then I expect to see an implementation detail section for example. This is a nit, so take it for what it is worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The high-level overview is in Proposal and Notes. This is the section for nitty-gritty details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole doc is a proposal (it is a key enhancement proposal) :)
there are limits to nitty-gritty details, there are ones that are useful and ones that add confusion because they don't add to the understanding of what is being proposed, my feedback is that this added confusion... again, I am replying not because I am insisting on the change, but just as a response to the justification you provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ahg-g here, I'd rather have a high level algorithm described here and we can put all the details in the actual implementation. This should relfect high-level flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this as a request from API reviewer to be explicit about continuing to process the Pods that succeed.
Moved it as a note a paragraph below.
This allows for deleted Pods to not count towards the failed counter, like it | ||
is today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not clear to me which part is "like it is today", is this a change of behavior? if not, I would s/like it is today/and so this doesn't change the current behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not a change of behavior. Adjusting wording
Note that step 2 doesn't pay attention to deletion timestamps. So the Job | ||
controller might account the Pod in the lists and counters if the Pods | ||
enter Failed or Succeeded phase before or while being deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, why is this worth noting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the controller works today. Except that once the Pods are completely removed, they don't count anymore.
So maybe we should just stop counting pods with a termination timestamp altogether? That's the discussion for this UNRESOLVED section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate your input here. There is one case where I think it might be worth reconsidering the semantics. Let's say a Node becomes unavailable and gets removed. In this case, the garbage collection also removes the Pod. Should this be considered a failure or not? Today, it's not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think users ideally want failures caused by non-zero exit code (a problem within the user's control) to be counted separately from failures out of their control (preemption, node failure).
Does Failure today include both or just the former? If it is just the former, we should try to keep it that way. If it is a mixture, then needs a bit more thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, it only includes failures from non-zero exit code. This is because other failures get the Pod removed.
Then, it sounds like we should be skipping Pods that have a termination timestamp from any calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, but there's a problem with the above:
The Pod GC doesn't take into account finalizers. That means that Pods that this controller marks for deletion would also not be counted towards calculations.
Then I think the solution would be: not to count Pods that have a termination timestamp and modify the Pod GC to skip Pods with finalizers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The impression I've gathered from what I've seen from user reports is that users get quite grumpy when a pod starts and doesn't complete for some reason, and would like to know about that. Would they like the two classes of failures conflated? I'm less sure about the answer to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said above, controller created pods that failed (removed, evicted, non-zero exit) should all be treated as failed. The fact that current controller doesn't do it is one of the limitation this proposal is trying to address. Iow. with improved status calculation we'll get more accurate signal of failure. Example, if there's a bad actor in the system consistently removing pods from a job an increased failure rate on a job might make the job author alarmed, and that's a good thing, imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this can be seen as a limitation of the legacy tracking.
Only when the job controller itself removes the Pod it should NOT be considered a failure. This happens when parallelism is reduced or when the Job controller created extra unnecessary Pods. Added a note.
If a Job with `.status.uncountedPodUIDs.enabled = true` can adopt a Pod, this | ||
might not have a finalizer. We can: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't read well. Also, isn't this possible only if the status is not yet set to Failed/Succeeded? if so, please clarify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The qualifications for Pod adoption don't change from current behavior. Clarified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the current qualification criteria include pods that failed/succeeded? If the pod was in finished/succeeded status, the absence of the finalizer could be because the pod was already processed by the controller in a previous cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that scenario, the Pod would already have an updated owner reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what guarantees that? is it possible that an owner reference be cleared after the pod was counted for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually there is currently no criteria other than the Job selector, which is mutable.
In theory, a Pod could finish, be accounted for Job A, loose its finalizer and then be claimed by another Job B. At this point, I think it's fair to adopt this Pod and account for it in Job B.
But what if Job A claims the Pod again? Then we have no way of adopting it without counting it again.
Perhaps we should actually only adopt a Pod if it has the finalizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should really not have this problem, because only one job can have the "controller reference"; you should not clear that, and no one else should either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, users should not, but they could. While unlikely, I think this is worth a note because in practice this is a regression compared to current situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to this being documented as unexpected behavior? https://kubernetes.io/docs/concepts/workloads/controllers/job/#specifying-your-own-pod-selector
Although I'm proposing a scenario where ownership is transferred from Job A to Job B and then back to Job A by carefully updating the selectors so that there are no conflicts at a given time. But maybe we can document this as unexpected behavior too.
Steps 2 to 4 might deal with a potentially big number of Pods. Thus, status | ||
updates can potentially stress the kube-apiserver. For this reason, the Job | ||
controller repeats steps 2 to 4, dealing with up to 100 Pods each time, until | ||
the controller processes all the Pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you added this in response to Daniel's comment.
@lavalamp , were you worried about too frequent of updates (not batching) or about the size of the diff from a single update (which the above seems concerned with).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm concerned about either. I do think it's probably good to limit the size of the chunks we process just to guarantee progress (partial update is better than never completing a huge update). But I personally would probably have a time limit-- process all the pods you can in 1m. I would say 15s but we have to consider the possibility of apiserver taking the maximum amount of time to respond to a request (currently 60s). The tests probably need to test under the "apiserver is really slow" condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
process all the pods you can in 1m.
Uhmm. That would have to look like:
- repeatedly update K Pods in parallel while 1m hasn't passed.
- setup a channel and K workers, continue processing for 1m.
So then we have to decide on K. Which exact method we use can probably be left for code reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @wojtek-t since you were asking about batching updates in the other KEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - I don't understand why updating the status (a) with N pods and (b) with M pods is different from kube-apiserver perspective.
I think what we should do is: (a) take a time into account, (b) have a cap on number of pods addeded.
We need a limit to ensure that the object won't grow to super large case, e.g. if job-controller was down for some time and in the meantime gazillions pod finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I take this as: we need to cap on time and number. Updated the KEP
- were removed from the system. | ||
The counts increment the `.status.failed` and `.status.succeeded` and clears | ||
counted Pods from `.status.uncountedPodsUIDs` lists. The controller sends a | ||
status update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talk about if there are changes to the logic where the controller decides whether to start new pods.
Does it look at the current .status.succeeded
and compare that to .spec.completions
? Or does it look at .status.succeeded + len( status.uncountedPodUIDs.succeeded
? The list could be non-empty due to the throttling mentioned below. Or does it block starting new pods if there are any uncounted pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to step 1.
I agree that the correct name should have terminated in it. What about uncountedTerminatedPodUIDs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, "uncounted" isn't quite right, it's "terminatedAndUncounted". Yes, that logic needs to look at 3 sources of information, a) the counts, b) the existing pods which are not in c) the terminatedAndUncounted list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
410f451
to
4164b63
Compare
the controller adds finished Pods to a list in the Job status. After removing | ||
the Pod finalizers, the controller clears the list and updates the counters. | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is increasing object size a risk that we should mention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What level of scale do you want to support? Do you have estimates of the sizes that these lists could be, in both ordinary operation and in different failure modes?
There are several problems with very large Job statuses:
- The default request size limit on etcd is 1.5MiB, not sure if that's what most folks use. But at that size and 36 characters for a UID that's ~ 40,000 UIDs.
- In addition to the strict size limit, every time any field is updated on the status, it requires the transmission of the entire list. So large objects increase load on the network and API server more than lots of small objects, which can be updated independently.
- Using kubectl will be very painful if there are 40,000 UIDs in the list.
I think instead it would be better to use a design that doesn't have an inherent limitation on job scale. Something like the way Endpoints were broken into EndpointSlices. In this case, you would have a new resource, JobSlice or JobTracker or something, that is used to manage these states. Those would be fixed size; say, corresponding to say 500 pods or so each. They would have a selector for the job, and pods would get selectors for the JobSlice instead of or in addition to the Job.
This allows arbitrarily high numbers of Pods per Job. It can also position us for sharding reaping the Pod changes across slices by having different controller instances/replicas manage different
JobSlices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been discussing that too.
The difference with Endpoints here is that Endpoints have gazillions of watches (e.g. every kube-proxy). It's not the case for Jobs - which changes a lot.
Just a single update is generally not that expensive, even it is for that huge object.
There are ways of mitigating stuff;
- for UIDs storing, we can have a limit, e.g. say 100 UIDs - we fully control that on the controller side and can do whatever we want
- in general, we control which pods and how fast we "finalize", so if we say we don't want Job object to exceed e.g. 10kB, we can achieve it - there is a tradeoff on "batch size vs number of batches"
So I actually think that isn't that bad design as it looks to be :).
TBH - I'm actually supportive for it from scalability POV if we implement it in a good way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, the KEP already mentions limiting the UIDs in the controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which are you doing:
Option 1:
- Say 100 pods recently finished.
- make 100 calls to apiserver for each pods resource.
- make 100 calls to apiserver for the job resource
Option 2:
- Say 100 pods recently finished.
- make 100 calls to apiserver for each pods resource.
- make 1 call to apiserver for the job resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option 2, but it's actually 2 api calls to the job resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added expected number of processed Pods per minute and a target of 5000 for beta graduation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, and I see GA grad criteria has 10^5 for completions (for a single job I presume). So the job tracking takes about 20 minutes for a job of that size, in this plan, in the degenerate scenario where every pod just exits. Is that 5000 per job, or across all jobs? If there are 10 jobs of 100k pods running, accounting takes 200 minutes? Is that OK for our users? (really asking, I have no idea. it sounds reasonable to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified 5000 to be across any number of Jobs.
As for 10^5 criteria, clarified to be per Job and in the order of minutes. I think we can aim to be 10min, but we can tune that after alpha is implemented.
// UncountedPodUIDs holds UIDs of Pods that have finished but haven't been | ||
// accounted in Job status counters. | ||
type UncountedPodUIDs struct { | ||
// Enabled indicates whether Job tracking uses this struct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for pointer, this is only meant for tracking status and when not used it'll be just empty which is sufficient.
keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md
Outdated
Show resolved
Hide resolved
keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md
Outdated
Show resolved
Hide resolved
- had no finalizer at the beginning of the sync, or | ||
- had the patch in step 3 succeed, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @ahg-g here, I'd rather have a high level algorithm described here and we can put all the details in the actual implementation. This should relfect high-level flow.
controller repeats steps 2 to 4, capping the number of Pods each time, until | ||
the controller processes all the Pods. The number of Pods is caped by: | ||
- time: in each iteration, the job controller removes all the Pods' finalizers | ||
it can in 1 minute. This allows to throttle the number of status updates. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to previous example, I'd suggest saying in a small unit of time, w/o actually specifying how small that is. This might change over time so it's better not to put it explicitly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But 👍 for the idea itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to "order of minutes". It's actually not a small unit of time.
controller tracks Pods using the legacy algorithm. | ||
|
||
The kube-apiserver sets `.status.uncountedTerminatedPodUIDs.enabled = true` when | ||
the feature gate `JobTrackingWithoutLingeringPods` is enabled, at Job creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is actually needed, the controller should be capable of reacting to the current state of the job and update its status accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this for any new API field to be able to handle partial upgrades in a HA control plane.
keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md
Outdated
Show resolved
Hide resolved
However, the ability to handle higher number of Pods and Jobs at a given | ||
point justifies the added cost of the new API calls. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the above two lines, can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
|
||
### Risks and Mitigations | ||
|
||
1. The new implementation produces new API calls: one per Pod to remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an estimate of the number of extra qps updates that will be used to implement this approach?
Workloads that run many short running single-pod jobs may not benefit too much from this new approach, at the same time it will negatively impact throughput since we are using some of the controller qps limit towards adding and removing the finalizer from the pod and perhaps extra updates to the job status.
/cc @wojtek-t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what else I can say: there is one more API call for each Pod created by the controller. And one for each Job, each time a group of Pods finish.
Other than that, there is no other option for single-pod jobs. We still need to guarantee that the Pod is accounted, so the finalizer is also needed.
I considered just adding the Pod to the uncounted lists at Pod creation. But that doesn't work, because we don't know if the Pod will be successful or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can estimate the impact on throughput assuming an x qps limit for the following two extremes: 1) for one pod per job and 2) many pods per job. The former is probably the more important one especially for short running jobs.
How many updates per job do we currently send for a single pod job? two (one for creating the job and one for updating the status when the job completes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more details and justification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And @lavalamp gave me the idea of one alleviation: to skip updates sometimes. See latest commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for one pod per job we will see ~50% drop in throughput? used to be 1 query to create the pod and 1 to update job status, now we add two more: another job status update and 1 to remove the finalizer?
2bbb8e5
to
cf9a010
Compare
account for deleted Pods. This is a limitation that this KEP also wants to | ||
solve. | ||
|
||
However, if it's the Job controller what deletes the Pod (when parallelism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if it's the Job controller what deletes the Pod (when parallelism | |
However, if the Job controller deletes the Pod (when parallelism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
e1bcc6e
to
baf6463
Compare
squashed |
the controller adds finished Pods to a list in the Job status. After removing | ||
the Pod finalizers, the controller clears the list and updates the counters. | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What level of scale do you want to support? Do you have estimates of the sizes that these lists could be, in both ordinary operation and in different failure modes?
There are several problems with very large Job statuses:
- The default request size limit on etcd is 1.5MiB, not sure if that's what most folks use. But at that size and 36 characters for a UID that's ~ 40,000 UIDs.
- In addition to the strict size limit, every time any field is updated on the status, it requires the transmission of the entire list. So large objects increase load on the network and API server more than lots of small objects, which can be updated independently.
- Using kubectl will be very painful if there are 40,000 UIDs in the list.
I think instead it would be better to use a design that doesn't have an inherent limitation on job scale. Something like the way Endpoints were broken into EndpointSlices. In this case, you would have a new resource, JobSlice or JobTracker or something, that is used to manage these states. Those would be fixed size; say, corresponding to say 500 pods or so each. They would have a selector for the job, and pods would get selectors for the JobSlice instead of or in addition to the Job.
This allows arbitrarily high numbers of Pods per Job. It can also position us for sharding reaping the Pod changes across slices by having different controller instances/replicas manage different
JobSlices.
* **Will enabling / using this feature result in any new API calls?** | ||
|
||
- PATCH Pods, to remove finalizers. | ||
- estimated throughput: one per created Pod, when Pod finishes or is removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one per Pod created by the Job controller, correct? Not for every Pod created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, added clarification.
- Pod | ||
- Estimated increase: new finalizer of 33 bytes. | ||
- Job status | ||
- Estimated increase: new array temporarily containing terminated Pod UIDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion above. Would like to see estimates on how big these arrays can get in both normal operation and failure modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments below - I think we can control this (and we should) - and i agree I want to see a real number here.
But I think this is fine if we do that the right way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
estimating to 10kb for about 500 UUIDs
the controller adds finished Pods to a list in the Job status. After removing | ||
the Pod finalizers, the controller clears the list and updates the counters. | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been discussing that too.
The difference with Endpoints here is that Endpoints have gazillions of watches (e.g. every kube-proxy). It's not the case for Jobs - which changes a lot.
Just a single update is generally not that expensive, even it is for that huge object.
There are ways of mitigating stuff;
- for UIDs storing, we can have a limit, e.g. say 100 UIDs - we fully control that on the controller side and can do whatever we want
- in general, we control which pods and how fast we "finalize", so if we say we don't want Job object to exceed e.g. 10kB, we can achieve it - there is a tradeoff on "batch size vs number of batches"
So I actually think that isn't that bad design as it looks to be :).
TBH - I'm actually supportive for it from scalability POV if we implement it in a good way.
- time: in each iteration, the job controller removes all the Pods' finalizers | ||
it can in a unit of time in the order of minutes. This allows to throttle the | ||
number of Job status updates. | ||
- count: Preventing big writes to etcd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to provide some estimate, e.g. no more than 500 UIDs can be stored in UncountedPods at any point in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saying "order of hundreds"
// haven't been accounted in the counters. | ||
// If nil, Job tracking doesn't make use of this structure. | ||
// +optional | ||
UncountedTerminatedPods *UncountedTerminatedPods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lavalamp @erictune - IIUC, in the past we were using a convention that status contains information, than can be computed based on the cluster state.
This is more or less, why things like "AcquireTime" or "RenewTime" are in LeaseSpec:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/coordination/v1/types.go#L40
[that was Brian`s suggestion back then].
Did we give up with this rule? If not, then all Succeeded
, Failed
, and UncountedTerminatedPods
technically no longer belong to status...
[I don't have strong opinion here - just want to flag this for discussion.]
@liggitt - for his API reviewer/approver opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this makes the Success / Finished numbers un-reproducible from the environment. UncountedTerminatedPods is (at least partially) reproducible! I think we are just giving up on this criteria to make this work. It's worth a discussion but I don't see other options. The reproducibility criterion is (IMO) mostly aesthetic (we've never made the system do anything that relied on it and it seems quite unlikely that we will in the future) and it's not worth it to keep it for this object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this makes the Success / Finished numbers un-reproducible from the environment. UncountedTerminatedPods is (at least partially) reproducible! I think we are just giving up on this criteria to make this work. It's worth a discussion but I don't see other options. The reproducibility criterion is (IMO) mostly aesthetic (we've never made the system do anything that relied on it and it seems quite unlikely that we will in the future) and it's not worth it to keep it for this object.
agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI, the intent of this API Convention was that status could be stored in some less reliable storage for performance. 5+ years on from this intention, we haven't made use of that possibility. And I really doubt that all clients are prepared for the event where status disappears and then is later reconstructed. I am thinking of e.g. kubectl waiting for a Ready condition, and then all the status disappears. Is the client code going to do the right thing? What is the right thing? We never designed for that. So we should remove that API convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - for me it was also kind of aesthetic thing, but I wanted to ensure that there is a general agreement on that.
It seems there is, which is what I wanted to achieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intent is also that this rule forces requirements on the cluster introspection mechanism to be able to surface anything important enough to track in status. If we have the ability to reconstruct status, then we know resources are not leaked out; our reconciliation is fully closed, without drift. If we put something in status that we cannot discover from the cluster itself we no longer have that guarantee - status could diverge from reality.
That said, in the case of job tracking, there are no resources leaked or lost track of. We lose track of some job stats potentially, but not things that are consuming underlying resources. So, only issue is the slippery slope.
|
||
* **Are there any tests for feature enablement/disablement?** | ||
|
||
Yes, there are integration tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there? I'm assuming these are planned, but let's not mention they already exist....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworded
- originating component: kube-controller-manager | ||
- PUT Job status, to keep track of uncounted Pods. | ||
- estimated throughput: at least one per Job sync. Extra calls are throttled | ||
at 1 per minute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 per minute might be too aggressive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this throttling delay reflecting job completion in the API object? I didn't see tests explicitly for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to test plan as integration.
- Pod | ||
- Estimated increase: new finalizer of 33 bytes. | ||
- Job status | ||
- Estimated increase: new array temporarily containing terminated Pod UIDs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added comments below - I think we can control this (and we should) - and i agree I want to see a real number here.
But I think this is fine if we do that the right way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this modulo one small comment.
Please gather SIG approval and I will approve.
// haven't been accounted in the counters. | ||
// If nil, Job tracking doesn't make use of this structure. | ||
// +optional | ||
UncountedTerminatedPods *UncountedTerminatedPods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - for me it was also kind of aesthetic thing, but I wanted to ensure that there is a general agreement on that.
It seems there is, which is what I wanted to achieve.
- originating component: kube-controller-manager | ||
- PUT Job status, to keep track of uncounted Pods. | ||
- estimated throughput: at least one per Job sync. The job controller | ||
throttles more calls at 1 per a few minutes (precise throughput TBD from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: per few seconds, not few minutes;
[or to be more specific I think what we should do is:
- as soon as we gather N pods (to feed the 20kB limit you mentioned below)
- but at least every X(=10?) seconds, to make it somewhat responsive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to seconds and added more detail in the Algorithm section.
Some reviewers requested that the KEP leaves room for experimentation before settling on specific numbers.
289dc2a
to
7d98878
Compare
@janetkuo this is ready for SIG approval |
3462c95
to
43dd18f
Compare
Signed-off-by: Aldo Culquicondor <[email protected]>
43dd18f
to
0fe1d55
Compare
/lgtm |
type UncountedTerminatedPods struct { | ||
// Succeeded holds UIDs of succeeded Pods. | ||
Succeeded []types.UID | ||
// Succeeded holds UIDs of failed Pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
keps/sig-apps/2307-job-tracking-wihout-lingering-pods/README.md
Outdated
Show resolved
Hide resolved
cycle. | ||
|
||
1. The Job controller calculates the number of succeeded Pods as the sum of: | ||
- `.status.succeeded`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the behavior of .status.succeeded
is slightly different now. Does this break API compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janetkuo I don't think we are changing this significantly much, there are slight variances how this will get calculated during execution, but the final result will be the same. Nobody should ever rely on how this number is being updated, nor partial results. I'm not personally worried about that breaking API compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
1. The Job controller calculates the number of succeeded Pods as the sum of: | ||
- `.status.succeeded`, | ||
- the size of `job.status.uncountedTerminatedPods.succeeded` and | ||
- the number of finished Pods that are not in `job.status.uncountedTerminatedPods.succeeded`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't these overlap with the pods counted in .status.succeeded
? Would you rephrase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, they should have a finalizer too, fixed.
|
||
However, if the Job controller deletes the Pod (when parallelism is decreased, | ||
for example), the controller removes the finalizer before deleting it. Thus, | ||
these deletions don't count towards the failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the description here, it'd apply to suspended Jobs (KEP 2232) as well, I assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but applies to any other future feature that might delete pods.
during API review. | ||
|
||
### Algorithm | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user deletes the Job, how are the finalizers of its Pods handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a section.
#### Beta -> GA Graduation | ||
|
||
- E2E test graduates to conformance. | ||
- Job tracking scales to 10^5 completions per Job processed within an order of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we assume that "order of minutes" is 5 minutes - that would mean O(300) QPS to delete finalizers.
I don't think we should be that aggressive - it's much more than we support now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more like 10min. Is that still too much? This would be assuming that this is the only Job though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO in the situation where you have a job that's that big, 300 QPS is perfectly reasonable-- that's probably a large fraction of what you're running on the cluster anyway. Compare it to the writes it took to actually get all those pods through their lifecycle. It's not adding much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're saying how it should be in ideal world - sure, I believe even more than should be possible.
If you're asking where we are now - I doubt we will be able to do that (I mean, we can handle 300 qps of course, but we can't blindly bump qps limits until we have P&F supporting things like LISTS, and also some rate-limiting that would rate-limit watches - whatever that means).
Also note - that it highly depends on how long a particular pod takes to run - if those are very short-running, then it matters, if they are long-running, then it's not that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still in the Beta -> GA graduation, so I think it should be fine to have this as tentative goal
Signed-off-by: Aldo Culquicondor <[email protected]>
/approve PRR |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, erictune, janetkuo, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Adding lgtm, as Janet`s approval was the last thing in this PR. /lgtm |
Ref #2307 kubernetes/kubernetes#8486
/sig apps