Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle starting worker throttling inside worker pool #28551

Merged
merged 13 commits into from
Sep 23, 2022

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Sep 15, 2022

Signed-off-by: Jiajun Yao [email protected]

Why are these changes needed?

Currently, worker pool has throttling of how many workers can be started simultaneously (i.e. maximum_startup_concurrency_). Right now if a PopWorker call cannot be fulfilled due to throttling, it will fail and the caller (i.e. local task manger) will handle the retry. The issue is that when PopWorker fails, local task manager will release the resources claimed by the task. As a result, even though the node already has enough tasks to use up all the resources, it will still show available resources and attract more tasks than it can handle. Instead of letting local task manager handles the throttling, it should be handled internally in worker pool since throttling is a transient thing and is not a real error. It's effectively the same as longer worker startup time.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jjyao jjyao marked this pull request as ready for review September 20, 2022 18:19
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I'm curious to see if scheduling throughput improves on test_many_tasks with this change.

@stephanie-wang
Copy link
Contributor

May be too difficult to write one that isn't flaky, but you could consider adding a Python test too for checking that the resource availability accounting is correct while workers are starting.

@@ -968,6 +970,10 @@ void WorkerPool::PushWorker(const std::shared_ptr<WorkerInterface> &worker) {
// TODO(SongGuyang): This worker will not be used forever. We should kill it.
state.idle_dedicated_workers[task_id] = worker;
}
// We either have an idle worker or a slot to start a new worker.
if (worker->GetWorkerType() == rpc::WorkerType::WORKER) {
TryPendingPopWorkerRequests(worker->GetLanguage());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to invoke this function in OnWorkerStarted ? I suggest this because the flag is_pending_registration is set to false in this function and it is related to the concurrent starting worker process count.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't only retry pending pop worker requests when is_pending_registration is set to false but also when an idle worker is returned. PushWorker() handles both case.

If we do it in OnWorkerStarted() then we may start more workers than necessary. Imagining a pre-started worker is registered, OnWorkerStarted() is called first, if we retry in this function, then we will start a new worker (since the new worker is not pushed yet) but if we retry in PushWorker then we can simply use the pre-started worker without starting a new one.

} else if (status == PopWorkerStatus::TooManyStartingWorkerProcesses) {
DeleteRuntimeEnvIfPossible(task_spec.SerializedRuntimeEnv());
state.pending_pop_worker_requests.emplace_back(
PopWorkerRequest{task_spec, callback, allocated_instances_serialized_json});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to avoid the memory copy of task_spec? Maybe we can optimize it in future and add a TODO here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make a copy here. The caller pass this in as a reference, if we don't make a copy, the caller may destruct the object later on.

@@ -1177,13 +1186,14 @@ void WorkerPool::PopWorker(const TaskSpecification &task_spec,
} else {
state.starting_workers_to_tasks[startup_token] = std::move(task_info);
}
} else if (status == PopWorkerStatus::TooManyStartingWorkerProcesses) {
DeleteRuntimeEnvIfPossible(task_spec.SerializedRuntimeEnv());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the ideal way is that we don't DeleteRuntimeEnvIfPossible here and reuse it when pop worker next time, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I agree. I'll add a TODO here and address it in the follow-up PR since it's not a regression. Does that sound good to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for me.

Copy link
Contributor

@SongGuyang SongGuyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. We can merge it first after you add the TODO comments.

@jjyao
Copy link
Collaborator Author

jjyao commented Sep 23, 2022

Release tests look good: https://buildkite.com/ray-project/release-tests-pr/builds/16008#_. Didn't see improvement or regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants