-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix GPU first scheduling that is not working with placement group #19141
Conversation
Looks reasonable to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution makes sense to me. How can we test this?
I see, can we flip on the flag and check the test goes from failing to passing? |
Looks like a reasonable fix. |
@@ -715,6 +715,7 @@ def _live_node_ids(self): | |||
|
|||
def _available_resources_per_node(self): | |||
"""Returns a dictionary mapping node id to avaiable resources.""" | |||
self._check_connected() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a bug. I can remove it if you don't like it to be included here
@@ -550,8 +550,8 @@ def __init__(self): | |||
def get_location(self): | |||
return ray.worker.global_worker.node.unique_id | |||
|
|||
@ray.remote | |||
def task_cpu(num_cpus=0.5): | |||
@ray.remote(num_cpus=0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another. bug in tests
@@ -618,6 +618,60 @@ def g(): | |||
time.sleep(1) | |||
|
|||
|
|||
def test_gpu_scheduling_liveness(ray_start_cluster): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified this doesn't work when the change wasn't added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Windows failure: I will skip tests in windows as every other test in test scheduling has been disabled there |
} | ||
// Could not schedule on CPU-only nodes, schedule on GPU nodes as a last resort. | ||
best_node_id = HybridPolicyWithFilter(resource_request, local_node_id, nodes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just remove NodeFilter::kGPU here? So there are only two stages: (1) kCpuOnly, require_avail, then (2) no filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good point. Will make a fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, renamed cpu only to NonGpu because CpuOnly is misleading
Why are these changes needed?
This is one of implementation suggestion. I discovered the root cause, so we can discuss if my implementation makes sense (if we have a better option, we can choose that option).
cc @sasha-s @ericl @scv119 please share you guys thoughts.
#19129 disables the GPU scheduling policy due to instability in placement group.
Current scheduling policy
Hybrid Policy API
When we pick the best node with the hybrid policy with /require_available = false/, we have an invariant;
Basically the function is like this;
GPU scheduling Logic
This is how we chooses a node when gpu scheduling is enabled.
Ref: https://github.com/ray-project/ray/pull/18615/files#diff-8a271b85b61542daedc5dd81fd8922b15634535871ecac7d37e99b6c80058536R61
Problem
Imagine this scenario
2 nodes; GPU node (3 CPUs), and non-GPU node (3 CPUs)
Placement group with {1 GPU, 1 CPU} and {1 CPU} * 5
The placement group scheduling hangs because
Solution
The solution is simple. We try scheduling on CPU only nodes first, but with /require_available=true/ and do the same thing to. the gpu node. This will make sure we can spillback tasks if they are not "available" and only "feasible".
And if both cannot find the best node, we fallback to the original policy, which guarantees the liveness.
I verified it works with the repro.
Related issue number
#19130
Checks
scripts/format.sh
to lint the changes in this PR.