-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[refactor cluster-task-manage 4/n] refactor cluster_task_manager into distributed and local part #21660
Conversation
f4667be
to
aac0d2c
Compare
|
5b3f9d7
to
d64fbf9
Compare
d64fbf9
to
18016ff
Compare
Any plan to make the PR smaller? |
Yeah i think once I fixed all the tests I'll try to see how to make it easier to review! |
e435fa8
to
59e5aa3
Compare
I'm good about this PR since most of them are just to move things around. |
96dd28f
to
ca80c8f
Compare
info_by_sched_cls_.erase(scheduling_class); | ||
} | ||
if (is_infeasible) { | ||
// TODO(scv119): fail the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iycheng @rkooo567
this is one major change. as I discussed with Sang, this could only happen (a request is previously feasible and latter becomes infeasible) when a placement group resource being deallocated. thus we should just go ahead and fail the request.
Spillback(node_id, work); | ||
} | ||
NodeID node_id = NodeID::FromBinary(node_id_string); | ||
ScheduleOnNode(node_id, work); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is another major change: we hide the scheduling on local node/remote node into this scheduleOnNode
function.
Are you going to split into more PRs or this is the PR for review? |
ca80c8f
to
41422d9
Compare
57f3c6c
to
2bb01fc
Compare
2bb01fc
to
07f10d2
Compare
@jjyao I think there is no more low hanging fruits to break down this refactor at the moment. |
info_by_sched_cls_.erase(scheduling_class); | ||
} | ||
if (is_infeasible) { | ||
// TODO(scv119): fail the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced with Chen. It seems that this should never happen. Maybe let's just RAY_CHECK here in case of any failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, but why is it a bug? Is this code path not reachable when pg is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and thanks for the work. The architecture is much better than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also went it over with Chen. I think it is good to go. The current form seems to be the best for a single PR
kick off benchmarks here: https://buildkite.com/ray-project/periodic-ci/builds/2808#0ab42b71-4ceb-483d-bcd6-ac558243ea59 |
[this](const std::vector<ObjectID> &object_ids, | ||
std::vector<std::unique_ptr<RayObject>> *results) { | ||
return GetObjectsFromPlasma(object_ids, results); | ||
}, | ||
max_task_args_memory); | ||
cluster_task_manager_ = std::make_shared<ClusterTaskManager>( | ||
self_node_id_, | ||
std::dynamic_pointer_cast<ClusterResourceScheduler>(cluster_resource_scheduler_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The std::dynamic_pointer_cast
could be removed.
/// \param readyIds: The tasks which are now ready to be dispatched. | ||
virtual void TasksUnblocked(const std::vector<TaskID> &ready_ids) = 0; | ||
|
||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a redundant ;
?
oops, will address those comments in follow up PRs! |
… distributed and local part (ray-project#21660) This is a working in progress PR that splits cluster_task_manager into local and distributed parts. For the distributed scheduler (cluster_task_manager_: /// Schedules a task onto one node of the cluster. The logic is as follows: /// 1. Queue tasks for scheduling. /// 2. Pick a node on the cluster which has the available resources to run a /// task. /// * Step 2 should occur any time the state of the cluster is /// changed, or a new task is queued. /// 3. For tasks that's infeasible, put them into infeasible queue and reports /// it to gcs, where the auto scaler will be notified and start new node /// to accommodate the requirement. For the local task manager: /// It Manages the lifetime of a task on the local node. It receives request from /// cluster_task_manager (the distributed scheduler) and does the following /// steps: /// 1. Pulling task dependencies, add the task into to_dispatch queue. /// 2. Once task's dependencies are all pulled locally, the task becomes ready /// to dispatch. /// 3. For all tasks that are dispatch-ready, we schedule them by acquiring /// local resources (including pinning the objects in memory and deduct /// cpu/gpu and other resources from local resource manager), and start /// a worker. /// 4. If task failed to acquire resources in step 3, we will try to /// spill it to a different remote node. /// 5. When a worker finishes executing its task(s), the requester will return /// it and we should release the resources in our view of the node's state. /// 6. If a task has been waiting for arguments for too long, it will also be /// spilled back to a different node. ///
Why are these changes needed?
This is a working in progress PR that splits cluster_task_manager into local and distributed parts.
For the distributed scheduler (cluster_task_manager_:
/// Schedules a task onto one node of the cluster. The logic is as follows:
/// 1. Queue tasks for scheduling.
/// 2. Pick a node on the cluster which has the available resources to run a
/// task.
/// * Step 2 should occur any time the state of the cluster is
/// changed, or a new task is queued.
/// 3. For tasks that's infeasible, put them into infeasible queue and reports
/// it to gcs, where the auto scaler will be notified and start new node
/// to accommodate the requirement.
For the local task manager:
/// It Manages the lifetime of a task on the local node. It receives request from
/// cluster_task_manager (the distributed scheduler) and does the following
/// steps:
/// 1. Pulling task dependencies, add the task into to_dispatch queue.
/// 2. Once task's dependencies are all pulled locally, the task becomes ready
/// to dispatch.
/// 3. For all tasks that are dispatch-ready, we schedule them by acquiring
/// local resources (including pinning the objects in memory and deduct
/// cpu/gpu and other resources from local resource manager), and start
/// a worker.
/// 4. If task failed to acquire resources in step 3, we will try to
/// spill it to a different remote node.
/// 5. When a worker finishes executing its task(s), the requester will return
/// it and we should release the resources in our view of the node's state.
/// 6. If a task has been waiting for arguments for too long, it will also be
/// spilled back to a different node.
///
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.