Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enable gcs actor scheduler 1/n] Raylet and GCS schedulers share cluster_task_manager #23829

Merged
merged 26 commits into from
May 2, 2022

Conversation

Chong-Li
Copy link
Contributor

@Chong-Li Chong-Li commented Apr 11, 2022

Why are these changes needed?

As discussed in #23460, the long-term plan to enable gcs scheduler is as follows:

  • Remove RayletBasedActorScheduler and GcsBasedActorScheduler. Instead, handling different actor requests (willing to be scheduled at raylet or gcs) in GcsActorScheduler.

  • GcsActorScheduler calls ClusterTaskManager, and then ClusterTaskManager calls ClusterResourceScheduler to select node. Besides, because ClusterTaskManager maintains the infeasible and ready queues, the pending queue in GcsActorManager should be removed. After this change, the GcsActorScheduler only manages the lifecycle of leasing (send/cancel lease and push creation task), while the queueing and scheduling (this PR is actually doing the same thing in a quicker way) are handled by ClusterTaskManager. If the leftover responsibility of GcsActorScheduler is simple enough, we may even remove it and reimplement the functions in GcsActorManager.

  • At raylets, the lease requests (that have already been handled by gcs' ClusterTaskManager) can skip the ClusterTaskManager and directly go to the LocalTaskManager.

  • Turn on the gcs scheduling feature flag and resolve any leftover failures.

This PR unifies RayletBasedActorScheduler and GcsBasedActorScheduler into GcsActorScheduler, which relies on ClusterTaskManager for queueing and scheduling.

The key function is GcsActorScheduler::Schedule(), in which we decide whether scheduling at GCS, or forwarding the request to a remote Raylet (owner node in most cases) for scheduling.

Related issue number

#23460

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Chong-Li Chong-Li marked this pull request as draft April 11, 2022 08:42
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 11, 2022
@jjyao jjyao self-assigned this Apr 11, 2022
@Chong-Li Chong-Li marked this pull request as ready for review April 13, 2022 03:46
@Chong-Li
Copy link
Contributor Author

@iycheng @scv119 Do you have any comments?

@Chong-Li Chong-Li requested a review from wumuzi520 April 25, 2022 15:22
@scv119
Copy link
Contributor

scv119 commented Apr 26, 2022

@Chong-Li thanks, the PR looks very good. I'll take a detailed review soon. Meanwhile the test failure looks related to this PR.

src/ray/raylet/scheduling/cluster_resource_scheduler.cc Outdated Show resolved Hide resolved
@@ -115,7 +107,7 @@ scheduling::NodeID ClusterResourceScheduler::GetBestSchedulableNode(
bool *is_infeasible) {
// The zero cpu actor is a special case that must be handled the same way by all
// scheduling policies.
if (actor_creation && resource_request.IsEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jjyao for this change.

Copy link
Collaborator

@jjyao jjyao Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we can decouple this change from this PR, seems unrelated?

Sure, I'm going to handle this together with the issue mentioned above (fixing exclude_local_node parameter).

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc Outdated Show resolved Hide resolved
false);
cluster_resource_manager.AddNodeAvailableResources(
scheduling::NodeID(actor->GetNodeID().Binary()), acquired_resources);
cluster_task_manager_->ScheduleAndDispatchTasks();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call normal_task_resources_changed_callback_ here? also should we create a helper function if this block is same as HandleWorkerLeaseRejectedReply ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call normal_task_resources_changed_callback_ here?

The actor's resources have nothing to do with normal_task_resources, so we don't need to call the callback.

also should we create a helper function if this block is same as HandleWorkerLeaseRejectedReply ?

Sure, just did.

Copy link
Contributor

@wumuzi520 wumuzi520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@scv119
Copy link
Contributor

scv119 commented Apr 27, 2022

let's wait for @jjyao's review as well!

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few more comments.

src/ray/gcs/gcs_server/gcs_actor_scheduler.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_actor_scheduler.cc Show resolved Hide resolved
@@ -115,7 +107,7 @@ scheduling::NodeID ClusterResourceScheduler::GetBestSchedulableNode(
bool *is_infeasible) {
// The zero cpu actor is a special case that must be handled the same way by all
// scheduling policies.
if (actor_creation && resource_request.IsEmpty()) {
Copy link
Collaborator

@jjyao jjyao Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we can decouple this change from this PR, seems unrelated?

Sure, I'm going to handle this together with the issue mentioned above (fixing exclude_local_node parameter).

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we run release tests to make sure everything is ok?

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really excited!

@raulchen raulchen merged commit f376713 into ray-project:master May 2, 2022
@raulchen raulchen deleted the refactoring_gcs_scheduler branch May 2, 2022 13:45
jjyao added a commit that referenced this pull request May 7, 2022
grant_or_reject for raylet based actor scheduling is implemented as part of #23829, so spread scheduling now works for actors just like tasks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants