-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true? #1617
Comments
training-operator version (v1.4.0)public.ecr.aws/j1r0q0g6/training/training-operator:174e8813666951ded505daf334a37f60fd50c18d |
A bit translation. MPIJob shows Status as 'Running' while not every related Pod is running. It seems two issues come at once:
I shall try to reproduce the issue and fix it. |
@zw0610 是的,确实存在2个问题 |
补充一些日志: == MPIJOB === POD == MPIJOB status === PodGroup status |
/assign @hackerboy01 |
请问下,MPIJOB刚性调度在资源不足的情况下,显示Running状态,是Bug吗?
== MPIJOB
NAME AGE STATE
hvd-tf1-mnist 16m Running
=== POD
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
hvd-tf1-mnist-launcher 0/1 Pending 0 14m
hvd-tf1-mnist-worker-0 1/1 Running 0 14m 10.42.1.172 openpai-212
hvd-tf1-mnist-worker-1 0/1 Pending 0 14m
==== PodGroup Status
status:
conditions:
- lastTransitionTime: "2022-06-17T10:21:33Z"
message: '2/1 tasks in gang unschedulable: pod group is not ready, 1 Running,
3 minAvailable'
reason: NotEnoughResources
status: "True"
transitionID: cd024380-e518-43f0-9c44-3664ebb10429
type: Unschedulable
phase: Unknown
running: 1
==== MPIJOB Status
status:
conditions:
- lastTransitionTime: "2022-06-17T10:06:10Z"
lastUpdateTime: "2022-06-17T10:06:10Z"
message: MPIJob aios/hvd-tf1-mnist is created.
reason: MPIJobCreated
status: "True"
type: Created
- lastTransitionTime: "2022-06-17T10:06:11Z"
lastUpdateTime: "2022-06-17T10:06:11Z"
message: MPIJob hvd-tf1-mnist is running.
reason: JobRunning
status: "True"
type: Running
replicaStatuses:
Launcher: {}
Worker:
active: 1
The text was updated successfully, but these errors were encountered: