Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler removing kubernetes Job Pods leading to JobBackOff #7095

Open
pythonking6 opened this issue Jul 27, 2024 · 3 comments
Open

Autoscaler removing kubernetes Job Pods leading to JobBackOff #7095

pythonking6 opened this issue Jul 27, 2024 · 3 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@pythonking6
Copy link

pythonking6 commented Jul 27, 2024

Which component are you using?:

cluster-autoscaler, running on AWS EKS

What version of the component are you using?:
https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.21.0

[ec2-user@ip-xx-xx-xx-xx ~]$ helm list -n kube-system NAME NAMESPACE CHART APP VERSION cluster-autoscaler kube-system cluster-autoscaler-9.21.0 1.23.0 cluster-proportional-autoscaler kube-system cluster-proportional-autoscaler-1.0.1 1.8.6

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
v1.25.0

What environment is this in?:

EKS running in AWS. The deployed cluster is using kubernetes version 1.28

What did you expect to happen?:

I have 40 kubernetes jobs that are scheduled simultaneously. If I manually scale up to 40 GPUs and disable downscaling, I have no issues, all 40 jobs run to completion. However, when I let the autoscaler scale up based on the nvidia.com/gpu: 1 request in the job manifest, two things happen:

  1. The autoscaler scales up twice as many GPUs as needed (so 80 instead of 40).
  2. The autoscaler then realizes that’s too many gpus and starts to downscale after coolDownPeriod.

I expect the autoscaler to allocated 40 GPUs. I also expect the autoscaler to leave long running pods untouched until they complete.
What happened instead?:

Some of the pods get a SIGTERM signal and terminate. This lead to a jobBackOffLimit reached (which I deliberately set to zero). Moreover, I have configured the autoscaler to have a utilization threshold of zero via the flag--scale-down-utilization-threshold=0, so that even if the pod isn’t using any of the gpu, the node under it should not be destroyed until the job is completed.
How to reproduce it (as minimally and precisely as possible):

Run 40 kubernetes jobs in the same namespace and let the autoscaler scale up and down as it sees fit.
Anything else we need to know?:

If I freeze the number of gpus to 40 and let the jobs run to completion, there are no issues. I have created a podDisruptionbudget of 1000 in the namespace for the jobs with a specific Label. Moreover, I have added the "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" to the job manifest.

@pythonking6 pythonking6 added the kind/bug Categorizes issue or PR as related to a bug. label Jul 27, 2024
@adrianmoisey
Copy link
Member

/area cluster-autoscaler

@davejab
Copy link

davejab commented Aug 1, 2024

Think I am experiencing a similar issue with chart version 9.37.0, EKS 1.28

The FAQ states:

What types of pods can prevent CA from removing a node?
...

  • Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *

This would imply jobs should be safe from eviction, however, can see cluster autoscaler evicting running jobs.

Did adding "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" successfully mitigate this for you?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants