Backoff policy and failed pod limit #583

soltysh · 2017-04-26T10:12:41Z

This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964.

@erictune I've mentioned this to you during last sig-apps, ptal
@kubernetes/sig-apps-feature-requests ptal
@lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

soltysh · 2017-04-26T10:14:05Z

contributors/design-proposals/job.md

+limits, described above, are set.
+
+All of the above fields will be optional and will apply no matter which `restartPolicy`
+is set on a `PodTemplate`.


I was struggling with this idea, but I think we have necessary information when failure happens in both cases. When Never we count failures ourselves, with OnFailure we can rely on no of restarts. Alternatively, I was thinking to address that only for Never, but I want to hear from others.

For our use case it would be useful to rely on OnFailure policy and then counting number of restarts (this is in fact what we are doing from outside, watching status over stream connections). Since we aim to submit large amounts of jobs, it would be better to have only one container retrying rather that having a new pod per failure spawned by the job controller. Unless you think this is not going to end up creating performance issues.

I interpret the Job restart limit to be predicated on the number of pod failures. If a pod has a restart policy of OnFailure, by my reading of the documentation, a pod in which its sole container terminates in failure means the whole pod transitions to the "Failed" state, perhaps only momentarily, even if the container gets restarted and the pod transitions back to the "Running" state. However, the "Example states" section at the end of the same document never mentions such a transition.

That means that to count the number of attempts for a pod template with restart policy of Never, we'd have to count the number of pods created for the Job. For a policy of OnFailure, that count is the number of pods created (assuming it's possible for a Job to lose a pod and have to schedule a replacement) plus the number of container restarts. (Each pod creation implies each container starting once.)

However, for a pod with multiple containers, we're then treating each failure of a container to imply one failure of the containing pod. That's confusing to reason about. The alternative is to say that a pod with a restart policy of OnFailure only participates in this accounting when the entire pod fails, such as due to the disk failing or the hosting node is removed from the cluster.

The reason I went with both actually is that I don't want to connect pod restart policy with job backoff policy, in the first place. Additionally, if pod's restart policy is never the kubelet will keep on creating new pods along the old ones (that's the job contract), which in turns the FailedPodsLimit can address.
Currently Never is the only one that returns the pod failure back to job controller which creates a new pod to satisfy completions. With OnFailure we're passing that responsibility to kubelet. After talking to @Kargakis I'm starting to think we should've restricted jobs to only use Never, but it's too late now because it would break to many folks. That's why I'd go with the backoff policy not being connected with pod's restart policy to limit the amount of confusion we already caused allowing both restart policies. Does that make sense?

It does make sense to me. It's important that users understand, though, that with OnFailure a container failure counts against the Job retry accounting.

If a Job uses pods with three containers, and two of the containers fail "at the same time" and get restarted by the kubelet, does that count as two retries of the Job?

I'll add that explanation, in that case.

nickschuch

All this makes perfect sense to me. Would you like me to start demonstrating this in my PR #41451?

nickschuch · 2017-04-26T10:38:10Z

contributors/design-proposals/job.md

+    // Optional number of retries, before marking this job failed.
+    BackoffLimit *int
+
+    // Optional time (in seconds), how log a job should be retried before marking it failed.


seh · 2017-04-26T12:58:34Z

contributors/design-proposals/job.md

+By design, Jobs do not have any notion of failure, other than Pod's `restartPolicy`
+which is mistakenly taken as Job's restart policy ([#30243](https://github.com/kubernetes/kubernetes/issues/30243),
+[#[43964](https://github.com/kubernetes/kubernetes/issues/43964)]).  There are
+situation where one wants to fail a Job after some amount of retries over certain


Did you mean "over a certain period" or "over multiple periods?"

a certain period, will fix.

seh · 2017-04-26T12:59:21Z

contributors/design-proposals/job.md

+[#[43964](https://github.com/kubernetes/kubernetes/issues/43964)]).  There are
+situation where one wants to fail a Job after some amount of retries over certain
+period of time, due to a logical error in configuration etc.  To do so we are going
+following fields will be introduced, which will control the exponential backoff


Is there a sentence fragment missing in between "we are going" and "following fields?"
Maybe, "we are going to introduce the following fields?"

seh · 2017-04-26T13:00:18Z

contributors/design-proposals/job.md

+situation where one wants to fail a Job after some amount of retries over certain
+period of time, due to a logical error in configuration etc.  To do so we are going
+following fields will be introduced, which will control the exponential backoff
+when retrying Job: number of retries and time to retry.  The two fields will allow


s/retrying Job/retrying a Job/
or maybe s/retying Job/retrying Jobs/

seh · 2017-04-26T13:00:39Z

contributors/design-proposals/job.md

+period of time, due to a logical error in configuration etc.  To do so we are going
+following fields will be introduced, which will control the exponential backoff
+when retrying Job: number of retries and time to retry.  The two fields will allow
+creating a fine grain control over the backoff policy, limiting the number of retries


s/a fine grain/fine-grained/

seh · 2017-04-26T13:01:04Z

contributors/design-proposals/job.md

+following fields will be introduced, which will control the exponential backoff
+when retrying Job: number of retries and time to retry.  The two fields will allow
+creating a fine grain control over the backoff policy, limiting the number of retries
+over specified period of time.  In the case when only one of them is specified


s/over specified/over a specified/

s/is specified/is specified,/

seh · 2017-04-26T13:08:09Z

contributors/design-proposals/job.md

+    // before the system tries to terminate it; value must be positive integer
+    ActiveDeadlineSeconds *int
+
+    // Optional number of retries, before marking this job failed.


s /retries,/retries/

seh · 2017-04-26T13:08:51Z

contributors/design-proposals/job.md

+    // Optional number of retries, before marking this job failed.
+    BackoffLimit *int
+
+    // Optional time (in seconds), how log a job should be retried before marking it failed.


s/), how/) specifying how/

Did you consider using type *uint instead?
That could encourage unintentional overflow when used with time.

seh · 2017-04-26T13:11:14Z

contributors/design-proposals/job.md

@@ -26,6 +27,31 @@ Jobs are needed for executing multi-pod computation to completion; a good exampl
 here would be the ability to implement any type of batch oriented tasks.


+## Backoff policy and failed pod limit
+
+By design, Jobs do not have any notion of failure, other than Pod's `restartPolicy`


s/than Pod's/than a pod's/

seh · 2017-04-26T13:11:39Z

contributors/design-proposals/job.md

@@ -18,6 +18,7 @@ Several existing issues and PRs were already created regarding that particular s
 1. Be able to get the job status.
 1. Be able to specify the number of instances performing a job at any one time.
 1. Be able to specify the number of successfully finished instances required to finish a job.
+1. Be able to specify backoff policy, when job is continuously failing.


s/specify backoff/specify a backoff/

seh · 2017-04-26T13:29:14Z

contributors/design-proposals/job.md

+limits, described above, are set.
+
+All of the above fields will be optional and will apply no matter which `restartPolicy`
+is set on a `PodTemplate`.


I interpret the Job restart limit to be predicated on the number of pod failures. If a pod has a restart policy of OnFailure, by my reading of the documentation, a pod in which its sole container terminates in failure means the whole pod transitions to the "Failed" state, perhaps only momentarily, even if the container gets restarted and the pod transitions back to the "Running" state. However, the "Example states" section at the end of the same document never mentions such a transition.

That means that to count the number of attempts for a pod template with restart policy of Never, we'd have to count the number of pods created for the Job. For a policy of OnFailure, that count is the number of pods created (assuming it's possible for a Job to lose a pod and have to schedule a replacement) plus the number of container restarts. (Each pod creation implies each container starting once.)

However, for a pod with multiple containers, we're then treating each failure of a container to imply one failure of the containing pod. That's confusing to reason about. The alternative is to say that a pod with a restart policy of OnFailure only participates in this accounting when the entire pod fails, such as due to the disk failing or the hosting node is removed from the cluster.

soltysh · 2017-04-26T14:14:39Z

All this makes perfect sense to me. Would you like me to start demonstrating this in my PR #41451?

@nickschuch let's wait for this to get to a stable shape (more or less) and then you can go ahead an open a fresh one for it. I'll close the old one.

soltysh · 2017-04-27T15:25:00Z

I've addressed first batch of comments.

seh

I cited one nit that we could fix and asked for clarification, but I won't hold up a merge for them.

seh · 2017-04-27T23:51:13Z

contributors/design-proposals/job.md

+is set on a `PodTemplate`.  The only difference applies to how failures are counted.
+For restart policy `Never` we count actual pod failures (reflected in `.status.failed`
+field). With restart policy `OnFailure` we look at pod restarts (calculated from
+`.status.containerStatuses[*].restartCount`).


seh · 2017-04-27T23:52:00Z

contributors/design-proposals/job.md

-too many failed pods left around (as mentioned in [#30243](https://github.com/kubernetes/kubernetes/issues/30243))
+to introduce following fields will be introduced, which will control the exponential
+backoff when retrying a Job: number of retries and time to retry.  The two fields
+will allow creating a fine-grained control over the backoff policy, limiting the


Reading this again, I think this would read better as

will allow fine-grained control over

seh · 2017-04-27T23:54:33Z

contributors/design-proposals/job.md

    BackoffLimit *int

-    // Optional time (in seconds), how log a job should be retried before marking it failed.
+    // Optional time (in seconds) specifying how long a job should be retried before marking it failed.


Obviously this duration shouldn't be negative (in which case I assume we'd ignore it, or reject it), but what about zero?
Since it's a pointer, the pointer being nil means that it's not set, but what if it's present but zero? Is that the same as it not being set, or is that a valid duration meaning that we won't wait any amount of time for a job complete (that it must complete immediately)?

We usually require those being >= 0, it'll be checked in validation.

agonzalezro

Nice approach, I think those changes are going to be really useful!

agonzalezro · 2017-04-28T09:15:47Z

contributors/design-proposals/job.md

+which is mistakenly taken as Job's restart policy ([#30243](https://github.com/kubernetes/kubernetes/issues/30243),
+[#[43964](https://github.com/kubernetes/kubernetes/issues/43964)]).  There are
+situation where one wants to fail a Job after some amount of retries over a certain
+period of time, due to a logical error in configuration etc.  To do so we are going


I think you are repeating "introduce{,d}". It should be something like: "To do so we are going to introduce the following fields"

agonzalezro · 2017-04-28T09:19:48Z

contributors/design-proposals/job.md

@@ -79,8 +109,21 @@ type JobSpec struct {
    // job should be run with. Defaults to 1.
    Completions *int

+    // Optional duration in seconds relative to the startTime that the job may be active
+    // before the system tries to terminate it; value must be a positive integer.
+    ActiveDeadlineSeconds *int


Can uint be used, also in the other fields?

We used ints, for consistency and then do appropriate validation so this is non-negative and bigger than 0.

*int32

Actually *int64

Can you also update the rest of the fields to their canonical types?

I think this is the deadline for the entire job to complete, regardless of whether there are restarts, right? Document that.

It clearly states it's relative to job's startTime.

agonzalezro · 2017-04-28T09:21:17Z

contributors/design-proposals/job.md

@@ -79,8 +109,21 @@ type JobSpec struct {
    // job should be run with. Defaults to 1.
    Completions *int

+    // Optional duration in seconds relative to the startTime that the job may be active


I think the comments for exported fields should start with the name of the field: "ActiveDeadlineSeconds is an optional..."

Not anymore, the def will have those in place.

soltysh · 2017-05-02T21:35:23Z

@erictune / @janetkuo any objections for merging this in?

0xmichalis · 2017-05-05T12:30:01Z

contributors/design-proposals/job.md

+    BackoffLimit *int
+
+    // Optional time (in seconds) specifying how long a job should be retried before marking it failed.
+    BackoffDeadlineSeconds *int


ProgressDeadlineSeconds to match the field in Deployments? Seems like it's the same concept although in Deployment we merely add a condition in the DeploymentStatus.

How would a user typically use ActiveDeadlineSeconds with BackoffDeadlineSeconds?

@erictune @soltysh if I understand correctly the only check system may run is: ActiveDeadlineSeconds must be strictly bigger than BackoffDeadlineSeconds, missing something?

@Kargakis this is different than ProgressDeadlineSecods, in deployments world PDS is how long you are waiting for the job to make progress and than fail. With Job BackoffDeadlineSeconds applies at any point in time. This means that if a Job expects to reach 5 completions, 4 of them can succeed and only the 5th Pod might be having troubles and BDS should still apply. That's why I'd rather not to use the same name, it's intentional.

@erictune @sdminonne ADS is the overall time for a Job (max life-time). Whereas BDS is counted from the moment the first failure happens. But yes, usually BDS will be smaller than ADS, otherwise you'll never hit BDS in case of a failure.

Well, PDS is not that different. We start counting from the last time we saw progress and once the deadline is exceeded we merely switch the Progressing condition to False. "Failing the deployment" should be something that a higher-level orchestrator decides based on the condition.

0xmichalis · 2017-05-05T12:32:39Z

contributors/design-proposals/job.md

+    BackoffDeadlineSeconds *int
+
+    // Optional number of failed pods to retain.
+    FailedPodsLimit *int


When are pods left around? I guess when pods have restartPolicy=Never? It feels weird to add a feature that is not related directly to the failurePolicy and also is useful for a specific case in Jobs. Can we leave it out of this proposal?

This is for debugging purposes. Leaving pods should allow you to figure out what went wrong. Without it and with restart policy Never you might end up with many failed pods, if you set the limits high. That was one of the original requests wrt to the backoff policy, so I think it's crucial to address it here.

0xmichalis · 2017-05-05T12:35:48Z

contributors/design-proposals/job.md

+to introduce following fields, which will control the exponential backoff when
+retrying a Job: number of retries and time to retry.  The two fields will allow
+fine-grained control over the backoff policy, limiting the number of retries over
+a specified period of time.  If only one of the two fields is supplied, an exponential


It's not clear what happens if both fields are specified. Is only the number of retries limited or both of the fields limit each other?

Ok, it's clearer after rereading the bullet points below.

erictune · 2017-05-05T17:51:32Z

contributors/design-proposals/job.md

+retrying a Job: number of retries and time to retry.  The two fields will allow
+fine-grained control over the backoff policy, limiting the number of retries over
+a specified period of time.  If only one of the two fields is supplied, an exponential
+backoff with an intervening duration of ten seconds and a factor of two will be


I don't think we should promise our users exponential backoff -- we should give ourselves the freedom to do other things in the future. Just say that we will retry at a time that the system choses, and that in any case the number of retries won't exceed the number specified by the user.

erictune · 2017-05-05T17:56:49Z

contributors/design-proposals/job.md

+too many failed pods left around (as mentioned in [#30243](https://github.com/kubernetes/kubernetes/issues/30243)),
+we are going to introduce a field which will allow specifying the maximum number
+of failed pods to keep around.  This number will also take effect if none of the
+limits described above are set.


Is the intent that users will typically set this to 0 unless they expect to need to debug? Or do we typically want to keep around at least 1 failed pod so the user can get the logs?

not save all pods without

I vote to have this defaulted to 1

1 seems reasonable default.

erictune · 2017-05-05T17:59:11Z

contributors/design-proposals/job.md

+is set on a `PodTemplate`.  The only difference applies to how failures are counted.
+For restart policy `Never` we count actual pod failures (reflected in `.status.failed`
+field). With restart policy `OnFailure` we look at pod restarts (calculated from
+`.status.containerStatuses[*].restartCount`).


This cannot be synchronous, though, so the limit has to be approximate. Document that it is precise with Never, and approximate with OnFailure?

Yup, makes sense.

erictune · 2017-05-05T18:00:02Z

contributors/design-proposals/job.md

@@ -79,8 +109,21 @@ type JobSpec struct {
    // job should be run with. Defaults to 1.
    Completions *int

+    // Optional duration in seconds relative to the startTime that the job may be active
+    // before the system tries to terminate it; value must be a positive integer.
+    ActiveDeadlineSeconds *int


I think this is the deadline for the entire job to complete, regardless of whether there are restarts, right? Document that.

erictune · 2017-05-05T18:00:30Z

contributors/design-proposals/job.md

+    ActiveDeadlineSeconds *int
+
+    // Optional number of retries before marking this job failed.
+    BackoffLimit *int


Why BackoffLimit instead of RetryLimit or TriesLimit?

I went with Backoff as a prefix for every field related to that, since we're talking about Backoff policy.

erictune · 2017-05-05T18:01:46Z

contributors/design-proposals/job.md

+    BackoffLimit *int
+
+    // Optional time (in seconds) specifying how long a job should be retried before marking it failed.
+    BackoffDeadlineSeconds *int


How would a user typically use ActiveDeadlineSeconds with BackoffDeadlineSeconds?

sdminonne · 2017-06-12T14:01:01Z

contributors/design-proposals/job.md

+[#[43964](https://github.com/kubernetes/kubernetes/issues/43964)]).  There are
+situation where one wants to fail a Job after some amount of retries over a certain
+period of time, due to a logical error in configuration etc.  To do so we are going
+to introduce following fields will be introduced, which will control the exponential


~~will be introduced~~

soltysh · 2017-06-22T11:45:29Z

@erictune @Kargakis I've addressed your comments, ptal once more. @sdminonne mind pointing your guy who will be implementing this to have a look at it, as well.

sdminonne · 2017-06-22T11:48:33Z

@soltysh Cedric's GH account is @clamoriniere1A

sdminonne · 2017-06-22T11:56:14Z

@clamoriniere1A mind have a look at this and eventually comment?

clamoriniere1A · 2017-06-23T14:53:05Z

@sdminonne sure, I have started to look at it.

clamoriniere1A · 2017-06-27T06:12:02Z

I started to work on the Job Backoff policy and failed pod limit: kubernetes/kubernetes#48075

k8s-github-robot · 2017-08-28T18:46:52Z

Automatic merge from submit-queue

Fix: kubernetes/community#583 Job failure policy integration in JobController. From the JobSpec.BackoffLimit and JobSpec.BackoffSeconds the JobController will define the backoff duration between Job retry. Currently the number of retry for each job is store in the DynamicBackoff struct that is local to the JobController. It means that if the JobController restarts the number of retries will be lost and the backoff duration will be reset to 0.

For the "Backoff policy and failed pod limit" implementation (kubernetes/community#583), we need to retrieve the number of backoff retry and also be able to set a different initial backoff duration for each entry. This Pull request adds two new methods to ```Backoff```: - ```NextWithInitDuration```: as ```Next``` the purpose of this method is to increment the backoff delay exponentially. In addition, you need to provide the initial delay duration if the entry is not already present. - ```GetWithRetryNumber```: returns the backoff delay associated to an entry and also the retry number corresponding to the number of time that the ```Next|GetWithRetryNumber``` has been call.

Fix: kubernetes/community#583 Job failure policy integration in JobController. From the JobSpec.BackoffLimit and JobSpec.BackoffSeconds the JobController will define the backoff duration between Job retry. Currently the number of retry for each job is store in the DynamicBackoff struct that is local to the JobController. It means that if the JobController restarts the number of retries will be lost and the backoff duration will be reset to 0.

Add new fields in api v1.JobSpec object for backoff policy - BackoffLimit - FailedPodsLimit fixes: kubernetes/community#583

Automatic merge from submit-queue (batch tested with PRs 51335, 51364, 51130, 48075, 50920) [API] Feature/job failure policy **What this PR does / why we need it**: Implements the Backoff policy and failed pod limit defined in kubernetes/community#583 **Which issue this PR fixes**: fixes #27997, fixes #30243 **Special notes for your reviewer**: This is a WIP PR, I updated the api batchv1.JobSpec in order to prepare the backoff policy implementation in the JobController. **Release note**: ```release-note Add backoff policy and failed pod limit for a job ```

…icy_controller Automatic merge from submit-queue Job failure policy controller support **What this PR does / why we need it**: Start implementing the support of the "Backoff policy and failed pod limit" in the ```JobController``` defined in kubernetes/community#583. This PR depends on a previous PR #48075 that updates the K8s API types. TODO: * [X] Implement ```JobSpec.BackoffLimit``` support * [x] Rebase when #48075 has been merged. * [X] Implement end2end tests implements kubernetes/community#583 **Special notes for your reviewer**: **Release note**: ```release-note Add backoff policy and failed pod limit for a job ```

paulwalker · 2017-09-07T00:17:45Z

This recent development looks good, but If I may ask, what is the current best config/pattern for a try-once job in 1.7.3? TIA

soltysh · 2017-09-08T11:01:20Z

@paulwalker there's nothing unfortunately. You'd need to implement similar mechanism inside of your job manually, I'm afraid 😞

0xmichalis · 2017-09-08T11:20:11Z

Or run a pod with RestartPolicy=Never.

chirag04 · 2017-09-20T14:24:39Z

@erictune can you elaborate on Job terminated after (approximately) first Pod Failure -- must use with Never to get close to Run-Once behavior?

why is that a close to Run-once behavior?

erictune · 2017-09-20T15:17:59Z

With these changes, a job with "RestartPolicy=Never" and BackoffLimit=0 with run once under almost all circumstances. However, the communication between the job controller, apiserver, kubelet and docker is not atomic. So, it is possible that under some very unlikely combination of crashes/restarts in the kubelet, docker and/or job controller, that the program in the Job's pod template might be started twice. Basically, I would use it for any task that has built-in idempotence, and for most types of tasks that don't have built-in idempotence. I would not use it for say, transferring money, without some additional application-level check to avoid duplicates.

…

On Wed, Sep 20, 2017 at 7:24 AM, Chirag ***@***.***> wrote: @erictune <https://github.com/erictune> can you elaborate on Job terminated after (approximately) first Pod Failure -- must use with Never to get close to Run-Once behavior? why is that a close to Run-once behavior? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#583 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHuudr9LQEwfmHXGZAexkK4AKRbwQybpks5skSAzgaJpZM4NIpmw> .

chirag04 · 2017-09-20T15:28:55Z

@erictune thanks for the explanation and that makes perfect sense. 👍

chenchun · 2017-12-12T02:15:58Z

contributors/design-proposals/job.md

+This will result in the following retry sequence: 10s, 20s, 40s, 1m20s, 2m40s,
+5m20s.  After which the job will be considered failed.
+
+Additionally, to help debug the issue with a Job, and limit the impact of having


Why not also limit the number of successful pods which may takes a lot of resource either ?

@erictune

Automatic merge from submit-queue Backoff policy and failed pod limit This update addresses problems raised in kubernetes/kubernetes#30243 and kubernetes/kubernetes#43964. @erictune I've mentioned this to you during last sig-apps, ptal @kubernetes/sig-apps-feature-requests ptal @lukasheinrich @nickschuch @Yancey1989 @sstarcher @maximz fyi since you all were interested in this

…netes#583) * Update WORKING-GROUPS.md * Update WORKING-GROUPS.md * Update WORKING-GROUPS.md

soltysh added 2 commits April 26, 2017 12:09

Update proposal to match current job spec

4be538d

Add backoff policy and failed pod limit

bd5d3d9

soltysh assigned erictune Apr 26, 2017

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 26, 2017

soltysh commented Apr 26, 2017

View reviewed changes

soltysh mentioned this pull request Apr 26, 2017

Maximum number of failures or failure backoff policy for Jobs kubernetes/kubernetes#30243

Closed

nickschuch reviewed Apr 26, 2017

View reviewed changes

seh suggested changes Apr 26, 2017

View reviewed changes

soltysh mentioned this pull request Apr 26, 2017

Job: Failure Threshold kubernetes/kubernetes#41451

Closed

Address the first batch of comments

caee494

seh approved these changes Apr 27, 2017

View reviewed changes

agonzalezro reviewed Apr 28, 2017

View reviewed changes

Address comments

00def1f

0xmichalis mentioned this pull request May 5, 2017

Failure policy for Jobs kubernetes/enhancements#298

Closed

0xmichalis reviewed May 5, 2017

View reviewed changes

erictune reviewed May 5, 2017

View reviewed changes

sdminonne reviewed Jun 12, 2017

View reviewed changes

Yancey1989 mentioned this pull request Jun 14, 2017

Kubernetes create so much Pods when a trainer failed PaddlePaddle/PaddleCloud#149

Closed

Address comments

5b6d2ce

clamoriniere1A mentioned this pull request Jun 26, 2017

[API] Feature/job failure policy kubernetes/kubernetes#48075

Merged

k8s-github-robot merged commit 0672a5f into kubernetes:master Aug 28, 2017

soltysh deleted the job_failure_policy branch August 29, 2017 07:08

This was referenced Aug 29, 2017

Case report: Job somehow made 84k pods kubernetes/kubernetes#27997

Closed

Notification of failed CronJobs kubernetes/kubernetes#22207

Closed

clamoriniere1A mentioned this pull request Aug 30, 2017

Feature/flowcontrol.backoff add retry number + init duration kubernetes/kubernetes#51626

Closed

clamoriniere1A added a commit to clamoriniere1A/kubernetes that referenced this pull request Sep 1, 2017

update API v1 Job object

07ef9eb

Add new fields in api v1.JobSpec object for backoff policy - BackoffLimit - FailedPodsLimit fixes: kubernetes/community#583

clamoriniere1A added a commit to clamoriniere1A/kubernetes that referenced this pull request Sep 1, 2017

update API v1 Job object

3989b18

Add new fields in api v1.JobSpec object for backoff policy - BackoffLimit - FailedPodsLimit fixes: kubernetes/community#583

clamoriniere1A mentioned this pull request Sep 6, 2017

Update Job workload documentation with backoff failure policy kubernetes/website#5319

Merged

advdv mentioned this pull request Nov 28, 2017

Investigate how a Job can be stopped nerdalize/nerd#197

Closed

chenchun reviewed Dec 12, 2017

View reviewed changes

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

Update WORKING-GROUPS.md to accurately reflect workgroup leads (kuber…

201ae03

…netes#583) * Update WORKING-GROUPS.md * Update WORKING-GROUPS.md * Update WORKING-GROUPS.md

Backoff policy and failed pod limit #583

Backoff policy and failed pod limit #583

Conversation

soltysh commented Apr 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nickschuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Apr 26, 2017

soltysh commented Apr 27, 2017

seh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agonzalezro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented May 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Jun 22, 2017

sdminonne commented Jun 22, 2017

sdminonne commented Jun 22, 2017

clamoriniere1A commented Jun 23, 2017

clamoriniere1A commented Jun 27, 2017 • edited Loading

k8s-github-robot commented Aug 28, 2017

paulwalker commented Sep 7, 2017

soltysh commented Sep 8, 2017

0xmichalis commented Sep 8, 2017

chirag04 commented Sep 20, 2017

erictune commented Sep 20, 2017 via email

clamoriniere1A commented Jun 27, 2017 •

edited

Loading