Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Client] Transfer dashboard_url over gRPC instead of ray.remote #30941

Merged
merged 4 commits into from
Dec 8, 2022

Conversation

ckw017
Copy link
Member

@ckw017 ckw017 commented Dec 7, 2022

Why are these changes needed?

The ray.remote call is spawning worker tasks on the head node even if their client doesn't do anything, spawning unexpected workers.

Note: dashboard_url behavior is already tested by test_client_builder

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ckw017 ckw017 requested review from rkooo567 and removed request for ericl December 7, 2022 19:01
Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an unit test to verify connecting a new job won't create IDLE workers until it submits a first task?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 7, 2022
@ckw017
Copy link
Member Author

ckw017 commented Dec 7, 2022

can you add an unit test to verify connecting a new job won't create IDLE workers until it submits a first task?

Is there an internal API I can use to check this easily? Even with the current behavior the IDLE worker gets cleaned up almost immediately, so its hard to tell when it happens without spam listing processes frequently

@rkooo567
Copy link
Contributor

rkooo567 commented Dec 7, 2022

IDLE workers are not killed if the # of them < num_cpus

You can probably use this API ray.experimental.state.api.list_workers. Basically we can

  1. connect
  2. wait for some time
  3. make sure no workers are started
  4. submit a task
  5. make sure # of idle workers are same as num_cpus

@ckw017
Copy link
Member Author

ckw017 commented Dec 8, 2022

@rkooo567 added a test, and confirmed that it fails without these changes (without changes, two worker processes spawn before we schedule any tasks)

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thanks for a quick fix @ckw017 !

@rkooo567
Copy link
Contributor

rkooo567 commented Dec 8, 2022

Lots of dataset tests failed. Is it related?

@rkooo567
Copy link
Contributor

rkooo567 commented Dec 8, 2022

seems like dataset failure is unrelated

@rkooo567 rkooo567 merged commit 5f5dd14 into ray-project:master Dec 8, 2022
@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Dec 8, 2022

How severe is the issue resolved here for users of Ray Client in Ray 2.2.0?

@ckw017
Copy link
Member Author

ckw017 commented Dec 8, 2022

How severe is the issue resolved here for users of Ray Client in Ray 2.2.0?

@rkooo567 is this worth picking into 2.2.0? IMO its low risk to add on if we're already planning other cherry picks anyway, but also think it can wait until the next release

@rkooo567
Copy link
Contributor

rkooo567 commented Dec 9, 2022

I think it is not allowed to cherry pick non critical fixes. So let's just do it in the next release

@will-hang
Copy link

This is actually really critical for us and is causing some of our core services to fail due to OOMs in ray head.

When memory leaks accumulate in ray head that seem related to these issues:

another IDLE worker spun up in ray head upon a Client connection will cause the pod to topple over, killing our service. We're looking forward to upgrading to 2.2 to hopefully resolve this.

Given that the change is so important and appears small, is there a way we could get it in 2.2? Or will we have to wait until 2.3?

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Dec 13, 2022

spawning unexpected workers

How many unexpected workers? What is the impact of the bug resolved here?

rkooo567 pushed a commit that referenced this pull request Dec 14, 2022
…30941)

The ray.remote call is spawning worker tasks on the head node even if their client doesn't do anything, spawning unexpected workers.

Note: dashboard_url behavior is already tested by test_client_builder
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…ay-project#30941)

The ray.remote call is spawning worker tasks on the head node even if their client doesn't do anything, spawning unexpected workers.

Note: dashboard_url behavior is already tested by test_client_builder
Signed-off-by: Weichen Xu <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…ay-project#30941)

The ray.remote call is spawning worker tasks on the head node even if their client doesn't do anything, spawning unexpected workers.

Note: dashboard_url behavior is already tested by test_client_builder
Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants