Autoscaler removing kubernetes Job Pods leading to JobBackOff #7095

pythonking6 · 2024-07-27T19:38:50Z

Which component are you using?:

cluster-autoscaler, running on AWS EKS

What version of the component are you using?:
https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.21.0


[ec2-user@ip-xx-xx-xx-xx ~]$ helm list -n kube-system
NAME     NAMESPACE   CHART  APP VERSION   
cluster-autoscaler  kube-system	 cluster-autoscaler-9.21.0  1.23.0
cluster-proportional-autoscaler	kube-system    cluster-proportional-autoscaler-1.0.1   1.8.6

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
v1.25.0

What environment is this in?:

EKS running in AWS. The deployed cluster is using kubernetes version 1.28

What did you expect to happen?:

I have 40 kubernetes jobs that are scheduled simultaneously. If I manually scale up to 40 GPUs and disable downscaling, I have no issues, all 40 jobs run to completion. However, when I let the autoscaler scale up based on the nvidia.com/gpu: 1 request in the job manifest, two things happen:

The autoscaler scales up twice as many GPUs as needed (so 80 instead of 40).
The autoscaler then realizes that’s too many gpus and starts to downscale after coolDownPeriod.

I expect the autoscaler to allocated 40 GPUs. I also expect the autoscaler to leave long running pods untouched until they complete.
What happened instead?:

Some of the pods get a SIGTERM signal and terminate. This lead to a jobBackOffLimit reached (which I deliberately set to zero). Moreover, I have configured the autoscaler to have a utilization threshold of zero via the flag--scale-down-utilization-threshold=0, so that even if the pod isn’t using any of the gpu, the node under it should not be destroyed until the job is completed.
How to reproduce it (as minimally and precisely as possible):

Run 40 kubernetes jobs in the same namespace and let the autoscaler scale up and down as it sees fit.
Anything else we need to know?:

If I freeze the number of gpus to 40 and let the jobs run to completion, there are no issues. I have created a podDisruptionbudget of 1000 in the namespace for the jobs with a specific Label. Moreover, I have added the "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" to the job manifest.

The text was updated successfully, but these errors were encountered:

adrianmoisey · 2024-07-29T14:14:05Z

/area cluster-autoscaler

davejab · 2024-08-01T08:20:22Z

Think I am experiencing a similar issue with chart version 9.37.0, EKS 1.28

The FAQ states:

What types of pods can prevent CA from removing a node?
...

Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *

This would imply jobs should be safe from eviction, however, can see cluster autoscaler evicting running jobs.

Did adding "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" successfully mitigate this for you?

k8s-triage-robot · 2024-10-30T08:52:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pythonking6 added the kind/bug Categorizes issue or PR as related to a bug. label Jul 27, 2024

k8s-ci-robot added the area/cluster-autoscaler label Jul 29, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler removing kubernetes Job Pods leading to JobBackOff #7095

Autoscaler removing kubernetes Job Pods leading to JobBackOff #7095

pythonking6 commented Jul 27, 2024 •

edited

Loading

adrianmoisey commented Jul 29, 2024

davejab commented Aug 1, 2024 •

edited

Loading

k8s-triage-robot commented Oct 30, 2024

Autoscaler removing kubernetes Job Pods leading to JobBackOff #7095

Autoscaler removing kubernetes Job Pods leading to JobBackOff #7095

Comments

pythonking6 commented Jul 27, 2024 • edited Loading

adrianmoisey commented Jul 29, 2024

davejab commented Aug 1, 2024 • edited Loading

k8s-triage-robot commented Oct 30, 2024

pythonking6 commented Jul 27, 2024 •

edited

Loading

davejab commented Aug 1, 2024 •

edited

Loading