[Enable gcs actor scheduler 1/n] Raylet and GCS schedulers share cluster_task_manager #23829

Chong-Li · 2022-04-11T08:41:57Z

Why are these changes needed?

As discussed in #23460, the long-term plan to enable gcs scheduler is as follows:

Remove RayletBasedActorScheduler and GcsBasedActorScheduler. Instead, handling different actor requests (willing to be scheduled at raylet or gcs) in GcsActorScheduler.
GcsActorScheduler calls ClusterTaskManager, and then ClusterTaskManager calls ClusterResourceScheduler to select node. Besides, because ClusterTaskManager maintains the infeasible and ready queues, the pending queue in GcsActorManager should be removed. After this change, the GcsActorScheduler only manages the lifecycle of leasing (send/cancel lease and push creation task), while the queueing and scheduling (this PR is actually doing the same thing in a quicker way) are handled by ClusterTaskManager. If the leftover responsibility of GcsActorScheduler is simple enough, we may even remove it and reimplement the functions in GcsActorManager.
At raylets, the lease requests (that have already been handled by gcs' ClusterTaskManager) can skip the ClusterTaskManager and directly go to the LocalTaskManager.
Turn on the gcs scheduling feature flag and resolve any leftover failures.

This PR unifies RayletBasedActorScheduler and GcsBasedActorScheduler into GcsActorScheduler, which relies on ClusterTaskManager for queueing and scheduling.

The key function is GcsActorScheduler::Schedule(), in which we decide whether scheduling at GCS, or forwarding the request to a remote Raylet (owner node in most cases) for scheduling.

Related issue number

#23460

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…cheduler

src/ray/raylet/scheduling/cluster_resource_manager.cc

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

src/ray/raylet/scheduling/cluster_resource_scheduler.h

src/ray/raylet/local_task_manager.h

src/ray/raylet/local_task_manager.cc

src/ray/raylet/scheduling/internal.h

src/ray/raylet/scheduling/cluster_task_manager.h

src/ray/raylet/scheduling/cluster_task_manager.cc

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

Chong-Li · 2022-04-14T09:54:25Z

@iycheng @scv119 Do you have any comments?

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

scv119 · 2022-04-26T02:35:56Z

@Chong-Li thanks, the PR looks very good. I'll take a detailed review soon. Meanwhile the test failure looks related to this PR.

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

scv119 · 2022-04-26T02:39:59Z

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

@@ -115,7 +107,7 @@ scheduling::NodeID ClusterResourceScheduler::GetBestSchedulableNode(
    bool *is_infeasible) {
  // The zero cpu actor is a special case that must be handled the same way by all
  // scheduling policies.
-  if (actor_creation && resource_request.IsEmpty()) {


cc @jjyao for this change.

I feel we can decouple this change from this PR, seems unrelated?

Sure, I'm going to handle this together with the issue mentioned above (fixing exclude_local_node parameter).

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

scv119 · 2022-04-26T02:49:33Z

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

+        false);
+    cluster_resource_manager.AddNodeAvailableResources(
+        scheduling::NodeID(actor->GetNodeID().Binary()), acquired_resources);
+    cluster_task_manager_->ScheduleAndDispatchTasks();


should we call normal_task_resources_changed_callback_ here? also should we create a helper function if this block is same as HandleWorkerLeaseRejectedReply ?

should we call normal_task_resources_changed_callback_ here?

The actor's resources have nothing to do with normal_task_resources, so we don't need to call the callback.

also should we create a helper function if this block is same as HandleWorkerLeaseRejectedReply ?

Sure, just did.

src/ray/gcs/gcs_server/gcs_actor_manager.cc

src/ray/gcs/gcs_server/gcs_actor_manager.h

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

wumuzi520

LGTM!

scv119 · 2022-04-27T06:02:38Z

let's wait for @jjyao's review as well!

…cheduler

jjyao

Looks good! A few more comments.

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

src/ray/gcs/gcs_server/gcs_resource_manager.cc

src/ray/raylet/scheduling/cluster_resource_data.cc

jjyao · 2022-04-27T16:36:35Z

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

@@ -115,7 +107,7 @@ scheduling::NodeID ClusterResourceScheduler::GetBestSchedulableNode(
    bool *is_infeasible) {
  // The zero cpu actor is a special case that must be handled the same way by all
  // scheduling policies.
-  if (actor_creation && resource_request.IsEmpty()) {


I feel we can decouple this change from this PR, seems unrelated?

Sure, I'm going to handle this together with the issue mentioned above (fixing exclude_local_node parameter).

src/ray/raylet/scheduling/cluster_task_manager.cc

jjyao

Could we run release tests to make sure everything is ok?

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

src/ray/raylet/scheduling/cluster_resource_data.cc

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc

rkooo567

Really excited!

…cheduler

grant_or_reject for raylet based actor scheduling is implemented as part of #23829, so spread scheduling now works for actors just like tasks.

Raylet and GCS schedulers share cluster_task_manager

3b674d1

Chong-Li marked this pull request as draft April 11, 2022 08:42

Fix gcs_actor_manager_test

9128857

rkooo567 assigned scv119, rkooo567 and fishbone Apr 11, 2022

rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 11, 2022

jjyao self-assigned this Apr 11, 2022

Chong-Li added 2 commits April 13, 2022 11:40

Add some explain

66e02cc

Merge remote-tracking branch 'upstream/master' into refactoring_gcs_s…

1d7f22c

…cheduler

Chong-Li marked this pull request as ready for review April 13, 2022 03:46

scv119 assigned wumuzi520 Apr 13, 2022