Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Reduce race condition between sequential job submission #592

Closed
1 of 2 tasks
Jeffwan opened this issue Sep 27, 2022 · 5 comments
Closed
1 of 2 tasks

[Feature] Reduce race condition between sequential job submission #592

Jeffwan opened this issue Sep 27, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Sep 27, 2022

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

When we submit jobs to existing cluster, there's an issue that the time job2 is created, job1 might not be fully deleted in the cluster.

  1. T1 - job1 CR submitted
  2. T2 - job1 CR is deleted
  3. T3 - job2 CR is created

It probably has two jobs running in the cluster at the same time. As a user, I want to submit job2 only if job1 is fully terminated.

/cc @Basasuya

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Jeffwan Jeffwan added the enhancement New feature or request label Sep 27, 2022
@asm582
Copy link
Contributor

asm582 commented Sep 28, 2022

I think this PR could help wrt to queuing and gang dispatching: #598

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Sep 28, 2022

This is more a Ray issue than a KubeRay issue, but we can definitely discuss here.

If I understand right, the concern is Ray-internal: It's hard to tell if the first Ray job is completely done before sending the second one to the same cluster.
@architkulkarni are you the main owner for the Ray job API? Do you have thoughts on how to guarantee clean job termination?

@architkulkarni
Copy link
Contributor

@DmitriGekhtman Yup that's me. Ray jobs supports concurrently running jobs and internally there's no notion of waiting for a job to finish before scheduling the next one. To do this with the Ray jobs SDK, you'd need to check the status in a loop until the first job returns a terminal status, like in the code sample here https://docs.ray.io/en/latest/cluster/running-applications/job-submission/sdk.html#submitting-a-ray-job. If using the Ray jobs CLI, ray job submit by default blocks the terminal and prints logs until the job reaches a terminal state.

@DmitriGekhtman
Copy link
Collaborator

@Jeffwan do you think polling for completed job is enough to enable sequential job submission?

@kevin85421
Copy link
Member

This seems to be a Ray issue rather than KubeRay issue. Close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants