Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix GPU first scheduling that is not working with placement group #19141

Merged
merged 11 commits into from
Oct 11, 2021

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Oct 6, 2021

Why are these changes needed?

This is one of implementation suggestion. I discovered the root cause, so we can discuss if my implementation makes sense (if we have a better option, we can choose that option).

cc @sasha-s @ericl @scv119 please share you guys thoughts.

#19129 disables the GPU scheduling policy due to instability in placement group.

Current scheduling policy

Hybrid Policy API

When we pick the best node with the hybrid policy with /require_available = false/, we have an invariant;

  • The function will return the best node if any node is "feasible" to schedule.

Basically the function is like this;

  • Find the best node that's available
  • Find the best node that's feasible

GPU scheduling Logic

This is how we chooses a node when gpu scheduling is enabled.

  1. Find the best node with non-GPU nodes
  2. If it cannot find the best node, schedule with GPU nodes.
    Ref: https://github.com/ray-project/ray/pull/18615/files#diff-8a271b85b61542daedc5dd81fd8922b15634535871ecac7d37e99b6c80058536R61

Problem

Imagine this scenario
2 nodes; GPU node (3 CPUs), and non-GPU node (3 CPUs)
Placement group with {1 GPU, 1 CPU} and {1 CPU} * 5

The placement group scheduling hangs because

Imagine the non-GPU node CPU bundles are all occupied.
1. non-GPU node has no more pg resources.
2. We should spillback the task to GPU node now.
3. Scheduling policy starts
4. We chooses the best node from non-GPU nodes first. 
5. The policy returns itself as a "feasible" best node. 
6. The function returns the non-GPU node, that's already full of resource usage because it is feasible.
7. Hangs forever because the non-GPU node uses actors

Solution

The solution is simple. We try scheduling on CPU only nodes first, but with /require_available=true/ and do the same thing to. the gpu node. This will make sure we can spillback tasks if they are not "available" and only "feasible".

And if both cannot find the best node, we fallback to the original policy, which guarantees the liveness.

I verified it works with the repro.

Related issue number

#19130

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567 rkooo567 assigned ericl, scv119 and sasha-s and unassigned ericl, scv119 and sasha-s Oct 6, 2021
@rkooo567 rkooo567 changed the title [Core] Fix GPU first scheduling that is not working with placement group [WIP][Core] Fix GPU first scheduling that is not working with placement group Oct 6, 2021
@sasha-s
Copy link
Contributor

sasha-s commented Oct 6, 2021

Looks reasonable to me.
I actually thought about doing it in the first place, but it looked like an overkill at a time.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution makes sense to me. How can we test this?

src/ray/raylet/scheduling/scheduling_policy.cc Outdated Show resolved Hide resolved
@rkooo567
Copy link
Contributor Author

rkooo567 commented Oct 6, 2021

@ericl This PR's regression test should fail without this fix when gpu scheduling is on. #19129

Also, I will write a cpp test as well. (should be easy to write one)

@ericl
Copy link
Contributor

ericl commented Oct 6, 2021

I see, can we flip on the flag and check the test goes from failing to passing?

@rkooo567
Copy link
Contributor Author

rkooo567 commented Oct 6, 2021

@ericl that was done in this pr! (I verified multiple times) #19129

@scv119
Copy link
Contributor

scv119 commented Oct 7, 2021

Looks like a reasonable fix.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 7, 2021
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 8, 2021
@rkooo567 rkooo567 changed the title [WIP][Core] Fix GPU first scheduling that is not working with placement group [Core] Fix GPU first scheduling that is not working with placement group Oct 8, 2021
@@ -715,6 +715,7 @@ def _live_node_ids(self):

def _available_resources_per_node(self):
"""Returns a dictionary mapping node id to avaiable resources."""
self._check_connected()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a bug. I can remove it if you don't like it to be included here

@@ -550,8 +550,8 @@ def __init__(self):
def get_location(self):
return ray.worker.global_worker.node.unique_id

@ray.remote
def task_cpu(num_cpus=0.5):
@ray.remote(num_cpus=0.5)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another. bug in tests

@@ -618,6 +618,60 @@ def g():
time.sleep(1)


def test_gpu_scheduling_liveness(ray_start_cluster):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this doesn't work when the change wasn't added

Copy link
Contributor

@sasha-s sasha-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rkooo567
Copy link
Contributor Author

rkooo567 commented Oct 8, 2021

Windows failure: I will skip tests in windows as every other test in test scheduling has been disabled there

}
// Could not schedule on CPU-only nodes, schedule on GPU nodes as a last resort.
best_node_id = HybridPolicyWithFilter(resource_request, local_node_id, nodes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just remove NodeFilter::kGPU here? So there are only two stages: (1) kCpuOnly, require_avail, then (2) no filters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good point. Will make a fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, renamed cpu only to NonGpu because CpuOnly is misleading

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 8, 2021
@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 11, 2021
@rkooo567 rkooo567 merged commit 3b865b4 into ray-project:master Oct 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants