Autoscaler removing kubernetes Job Pods leading to JobBackOff #7095
Labels
area/cluster-autoscaler
kind/bug
Categorizes issue or PR as related to a bug.
lifecycle/stale
Denotes an issue or PR has remained open with no activity and has become stale.
Which component are you using?:
cluster-autoscaler, running on AWS EKS
What version of the component are you using?:
https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.21.0
[ec2-user@ip-xx-xx-xx-xx ~]$ helm list -n kube-system NAME NAMESPACE CHART APP VERSION cluster-autoscaler kube-system cluster-autoscaler-9.21.0 1.23.0 cluster-proportional-autoscaler kube-system cluster-proportional-autoscaler-1.0.1 1.8.6
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
EKS running in AWS. The deployed cluster is using kubernetes version 1.28
What did you expect to happen?:
I have 40 kubernetes jobs that are scheduled simultaneously. If I manually scale up to 40 GPUs and disable downscaling, I have no issues, all 40 jobs run to completion. However, when I let the autoscaler scale up based on the
nvidia.com/gpu: 1
request in the job manifest, two things happen:I expect the autoscaler to allocated 40 GPUs. I also expect the autoscaler to leave long running pods untouched until they complete.
What happened instead?:
Some of the pods get a SIGTERM signal and terminate. This lead to a jobBackOffLimit reached (which I deliberately set to zero). Moreover, I have configured the autoscaler to have a utilization threshold of zero via the flag
--scale-down-utilization-threshold=0
, so that even if the pod isn’t using any of the gpu, the node under it should not be destroyed until the job is completed.How to reproduce it (as minimally and precisely as possible):
Run 40 kubernetes jobs in the same namespace and let the autoscaler scale up and down as it sees fit.
Anything else we need to know?:
If I freeze the number of gpus to 40 and let the jobs run to completion, there are no issues. I have created a podDisruptionbudget of 1000 in the namespace for the jobs with a specific Label. Moreover, I have added the
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
to the job manifest.The text was updated successfully, but these errors were encountered: