-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Client] Transfer dashboard_url over gRPC instead of ray.remote #30941
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an unit test to verify connecting a new job won't create IDLE workers until it submits a first task?
Is there an internal API I can use to check this easily? Even with the current behavior the IDLE worker gets cleaned up almost immediately, so its hard to tell when it happens without spam listing processes frequently |
IDLE workers are not killed if the # of them < num_cpus You can probably use this API
|
@rkooo567 added a test, and confirmed that it fails without these changes (without changes, two worker processes spawn before we schedule any tasks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thanks for a quick fix @ckw017 !
Lots of dataset tests failed. Is it related? |
seems like dataset failure is unrelated |
How severe is the issue resolved here for users of Ray Client in Ray 2.2.0? |
@rkooo567 is this worth picking into 2.2.0? IMO its low risk to add on if we're already planning other cherry picks anyway, but also think it can wait until the next release |
I think it is not allowed to cherry pick non critical fixes. So let's just do it in the next release |
This is actually really critical for us and is causing some of our core services to fail due to OOMs in When memory leaks accumulate in another IDLE worker spun up in Given that the change is so important and appears small, is there a way we could get it in 2.2? Or will we have to wait until 2.3? |
How many unexpected workers? What is the impact of the bug resolved here? |
…30941) The ray.remote call is spawning worker tasks on the head node even if their client doesn't do anything, spawning unexpected workers. Note: dashboard_url behavior is already tested by test_client_builder
…ay-project#30941) The ray.remote call is spawning worker tasks on the head node even if their client doesn't do anything, spawning unexpected workers. Note: dashboard_url behavior is already tested by test_client_builder Signed-off-by: Weichen Xu <[email protected]>
…ay-project#30941) The ray.remote call is spawning worker tasks on the head node even if their client doesn't do anything, spawning unexpected workers. Note: dashboard_url behavior is already tested by test_client_builder Signed-off-by: tmynn <[email protected]>
Why are these changes needed?
The ray.remote call is spawning worker tasks on the head node even if their client doesn't do anything, spawning unexpected workers.
Note: dashboard_url behavior is already tested by
test_client_builder
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.