Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: MPIJob worker still running when NotEnoughResources #1621

Merged
merged 1 commit into from
Jun 28, 2022

Conversation

hackerboy01
Copy link
Member

What this PR does / why we need it:
Before migrate v1 MPI operator to training-operator,only when all the workers of MPIJob are ready and the launcher of MPIJob is running,the state of MPIJob can change from Created to Running. So It is more reasonable to keep the STATE of MPIJob as created when NotEnoughResources.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #1617

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Jun 26, 2022

Pull Request Test Coverage Report for Build 2569596073

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.08%) to 39.839%

Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 2 77.65%
pkg/controller.v1/mpi/mpijob.go 2 92.98%
Totals Coverage Status
Change from base Build 2564570860: -0.08%
Covered Lines: 2323
Relevant Lines: 5831

💛 - Coveralls

@gaocegege
Copy link
Member

/assign @zw0610

Thanks for your contribution! 🎉 👍

@zw0610
Copy link
Member

zw0610 commented Jun 27, 2022

LGTM
@gaocegege @Jeffwan Could you double-check the Job Condition logic here? It's kind of mind-twisting.

@gaocegege
Copy link
Member

I am not sure if it breaks other cases. For example, there is a launcher that succeeded, but one of the workers is running.

What state will be updated after the PR?

@hackerboy01
Copy link
Member Author

I am not sure if it breaks other cases. For example, there is a launcher that succeeded, but one of the workers is running.

What state will be updated after the PR?

The state of Mpijob is very special and depends only on the state of the launcher. If there is a launcher that Succeeded, the state of Mpijob must be Succeeded. Only when all the workers of MPIJob are ready(running),the launcher of MPIJob can change from Created to Running.

@zw0610
Copy link
Member

zw0610 commented Jun 27, 2022

/retest

@johnugeorge
Copy link
Member

johnugeorge commented Jun 28, 2022

@zw0610 Do you plan to merge this in the current release? If yes, can you lgtm.
/cc @terrytangyuan

Related: #1622

@zw0610
Copy link
Member

zw0610 commented Jun 28, 2022

Great!
/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Jun 28, 2022
Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hackerboy01, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 931eae1 into kubeflow:master Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MPIJob worker still running when NotEnoughResources with enable-gang-scheduling==true?
6 participants