[Core] Fix GPU first scheduling that is not working with placement group #19141

rkooo567 · 2021-10-06T15:26:25Z

Why are these changes needed?

This is one of implementation suggestion. I discovered the root cause, so we can discuss if my implementation makes sense (if we have a better option, we can choose that option).

cc @sasha-s @ericl @scv119 please share you guys thoughts.

#19129 disables the GPU scheduling policy due to instability in placement group.

Current scheduling policy

Hybrid Policy API

When we pick the best node with the hybrid policy with /require_available = false/, we have an invariant;

The function will return the best node if any node is "feasible" to schedule.

Basically the function is like this;

Find the best node that's available
Find the best node that's feasible

GPU scheduling Logic

This is how we chooses a node when gpu scheduling is enabled.

Find the best node with non-GPU nodes
If it cannot find the best node, schedule with GPU nodes.
Ref: https://github.com/ray-project/ray/pull/18615/files#diff-8a271b85b61542daedc5dd81fd8922b15634535871ecac7d37e99b6c80058536R61

Problem

Imagine this scenario
2 nodes; GPU node (3 CPUs), and non-GPU node (3 CPUs)
Placement group with {1 GPU, 1 CPU} and {1 CPU} * 5

The placement group scheduling hangs because

Imagine the non-GPU node CPU bundles are all occupied.
1. non-GPU node has no more pg resources.
2. We should spillback the task to GPU node now.
3. Scheduling policy starts
4. We chooses the best node from non-GPU nodes first. 
5. The policy returns itself as a "feasible" best node. 
6. The function returns the non-GPU node, that's already full of resource usage because it is feasible.
7. Hangs forever because the non-GPU node uses actors

Solution

The solution is simple. We try scheduling on CPU only nodes first, but with /require_available=true/ and do the same thing to. the gpu node. This will make sure we can spillback tasks if they are not "available" and only "feasible".

And if both cannot find the best node, we fallback to the original policy, which guarantees the liveness.

I verified it works with the repro.

Related issue number

#19130

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

sasha-s · 2021-10-06T16:57:37Z

Looks reasonable to me.
I actually thought about doing it in the first place, but it looked like an overkill at a time.

ericl

The solution makes sense to me. How can we test this?

src/ray/raylet/scheduling/scheduling_policy.cc

rkooo567 · 2021-10-06T22:53:29Z

@ericl This PR's regression test should fail without this fix when gpu scheduling is on. #19129

Also, I will write a cpp test as well. (should be easy to write one)

ericl · 2021-10-06T23:00:39Z

I see, can we flip on the flag and check the test goes from failing to passing?

rkooo567 · 2021-10-06T23:46:29Z

@ericl that was done in this pr! (I verified multiple times) #19129

scv119 · 2021-10-07T00:26:46Z

Looks like a reasonable fix.

This reverts commit 56b18f0.

This reverts commit a34c90b.

rkooo567 · 2021-10-08T06:16:40Z

python/ray/state.py

@@ -715,6 +715,7 @@ def _live_node_ids(self):

    def _available_resources_per_node(self):
        """Returns a dictionary mapping node id to avaiable resources."""
+        self._check_connected()


It was a bug. I can remove it if you don't like it to be included here

rkooo567 · 2021-10-08T06:16:48Z

python/ray/tests/test_scheduling.py

@@ -550,8 +550,8 @@ def __init__(self):
            def get_location(self):
                return ray.worker.global_worker.node.unique_id

-        @ray.remote
-        def task_cpu(num_cpus=0.5):
+        @ray.remote(num_cpus=0.5)


another. bug in tests

rkooo567 · 2021-10-08T06:17:09Z

python/ray/tests/test_scheduling.py

@@ -618,6 +618,60 @@ def g():
        time.sleep(1)


+def test_gpu_scheduling_liveness(ray_start_cluster):


I verified this doesn't work when the change wasn't added

sasha-s

LGTM

rkooo567 · 2021-10-08T22:48:28Z

Windows failure: I will skip tests in windows as every other test in test scheduling has been disabled there

ericl · 2021-10-08T23:54:50Z

src/ray/raylet/scheduling/scheduling_policy.cc

  }
  // Could not schedule on CPU-only nodes, schedule on GPU nodes as a last resort.
+  best_node_id = HybridPolicyWithFilter(resource_request, local_node_id, nodes,


Can we just remove NodeFilter::kGPU here? So there are only two stages: (1) kCpuOnly, require_avail, then (2) no filters.

Ah yeah good point. Will make a fix

Btw, renamed cpu only to NonGpu because CpuOnly is misleading

done

56b18f0

rkooo567 assigned ericl, scv119 and sasha-s and unassigned ericl, scv119 and sasha-s Oct 6, 2021

rkooo567 changed the title ~~[Core] Fix GPU first scheduling that is not working with placement group~~ [WIP][Core] Fix GPU first scheduling that is not working with placement group Oct 6, 2021

ericl reviewed Oct 6, 2021

View reviewed changes

src/ray/raylet/scheduling/scheduling_policy.cc Outdated Show resolved Hide resolved

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 7, 2021

rkooo567 mentioned this pull request Oct 7, 2021

[Core] RPC based graceful shutdown for raylet #19072

Closed

6 tasks

rkooo567 added 5 commits October 7, 2021 16:45

Merge branch 'master' into fix-gpu-scheduling

e30743c

Revert "done"

a34c90b

This reverts commit 56b18f0.

ip

56e22bf

Revert "Revert "done""

aebea12

This reverts commit a34c90b.

Done

b675c34

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 8, 2021

rkooo567 changed the title ~~[WIP][Core] Fix GPU first scheduling that is not working with placement group~~ [Core] Fix GPU first scheduling that is not working with placement group Oct 8, 2021

Remove unnecessary log message

54bd763

rkooo567 commented Oct 8, 2021

View reviewed changes

sasha-s approved these changes Oct 8, 2021

View reviewed changes

rkooo567 assigned ericl and sasha-s Oct 8, 2021

Merge branch 'master' into fix-gpu-scheduling

614db0a

skip test on windows

79ec2c8

ericl reviewed Oct 8, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 8, 2021

rkooo567 added 2 commits October 10, 2021 05:27

Handle the code review.

e28eadf

Merge branch 'master' into fix-gpu-scheduling

5e8498f

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 11, 2021

rkooo567 merged commit 3b865b4 into ray-project:master Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Fix GPU first scheduling that is not working with placement group #19141

[Core] Fix GPU first scheduling that is not working with placement group #19141

rkooo567 commented Oct 6, 2021 •

edited

Loading

sasha-s commented Oct 6, 2021

ericl left a comment

rkooo567 commented Oct 6, 2021 •

edited

Loading

ericl commented Oct 6, 2021

rkooo567 commented Oct 6, 2021

scv119 commented Oct 7, 2021

rkooo567 Oct 8, 2021

rkooo567 Oct 8, 2021

rkooo567 Oct 8, 2021

sasha-s left a comment

rkooo567 commented Oct 8, 2021

ericl Oct 8, 2021

rkooo567 Oct 9, 2021

rkooo567 Oct 10, 2021

		@@ -618,6 +618,60 @@ def g():
		time.sleep(1)


		def test_gpu_scheduling_liveness(ray_start_cluster):

[Core] Fix GPU first scheduling that is not working with placement group #19141

[Core] Fix GPU first scheduling that is not working with placement group #19141

Conversation

rkooo567 commented Oct 6, 2021 • edited Loading

Why are these changes needed?

Current scheduling policy

Hybrid Policy API

GPU scheduling Logic

Problem

Solution

Related issue number

Checks

sasha-s commented Oct 6, 2021

ericl left a comment

Choose a reason for hiding this comment

rkooo567 commented Oct 6, 2021 • edited Loading

ericl commented Oct 6, 2021

rkooo567 commented Oct 6, 2021

scv119 commented Oct 7, 2021

rkooo567 Oct 8, 2021

Choose a reason for hiding this comment

rkooo567 Oct 8, 2021

Choose a reason for hiding this comment

rkooo567 Oct 8, 2021

Choose a reason for hiding this comment

sasha-s left a comment

Choose a reason for hiding this comment

rkooo567 commented Oct 8, 2021

ericl Oct 8, 2021

Choose a reason for hiding this comment

rkooo567 Oct 9, 2021

Choose a reason for hiding this comment

rkooo567 Oct 10, 2021

Choose a reason for hiding this comment

rkooo567 commented Oct 6, 2021 •

edited

Loading

rkooo567 commented Oct 6, 2021 •

edited

Loading