-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: MPIJob worker still running when NotEnoughResources #1621
Conversation
Pull Request Test Coverage Report for Build 2569596073
💛 - Coveralls |
522923e
to
e7c866a
Compare
/assign @zw0610 Thanks for your contribution! 🎉 👍 |
LGTM |
I am not sure if it breaks other cases. For example, there is a launcher that succeeded, but one of the workers is running. What state will be updated after the PR? |
e7c866a
to
769656e
Compare
The state of Mpijob is very special and depends only on the state of the launcher. If there is a launcher that Succeeded, the state of Mpijob must be Succeeded. Only when all the workers of MPIJob are ready(running),the launcher of MPIJob can change from Created to Running. |
/retest |
769656e
to
df46ee8
Compare
df46ee8
to
8ffd13f
Compare
@zw0610 Do you plan to merge this in the current release? If yes, can you lgtm. Related: #1622 |
Great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hackerboy01, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Before migrate v1 MPI operator to training-operator,only when all the workers of MPIJob are ready and the launcher of MPIJob is running,the state of MPIJob can change from Created to Running. So It is more reasonable to keep the STATE of MPIJob as created when NotEnoughResources.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #1617
Checklist: