Handle starting worker throttling inside worker pool #28551

jjyao · 2022-09-15T23:02:24Z

Signed-off-by: Jiajun Yao [email protected]

Why are these changes needed?

Currently, worker pool has throttling of how many workers can be started simultaneously (i.e. maximum_startup_concurrency_). Right now if a PopWorker call cannot be fulfilled due to throttling, it will fail and the caller (i.e. local task manger) will handle the retry. The issue is that when PopWorker fails, local task manager will release the resources claimed by the task. As a result, even though the node already has enough tasks to use up all the resources, it will still show available resources and attract more tasks than it can handle. Instead of letting local task manager handles the throttling, it should be handled internally in worker pool since throttling is a transient thing and is not a real error. It's effectively the same as longer worker startup time.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Jiajun Yao <[email protected]>

…_cap

Signed-off-by: Jiajun Yao <[email protected]>

…_cap

Signed-off-by: Jiajun Yao <[email protected]>

stephanie-wang

LGTM!

I'm curious to see if scheduling throughput improves on test_many_tasks with this change.

stephanie-wang · 2022-09-20T19:33:41Z

May be too difficult to write one that isn't flaky, but you could consider adding a Python test too for checking that the resource availability accounting is correct while workers are starting.

SongGuyang · 2022-09-21T03:05:22Z

src/ray/raylet/worker_pool.cc

@@ -968,6 +970,10 @@ void WorkerPool::PushWorker(const std::shared_ptr<WorkerInterface> &worker) {
      // TODO(SongGuyang): This worker will not be used forever. We should kill it.
      state.idle_dedicated_workers[task_id] = worker;
    }
+    // We either have an idle worker or a slot to start a new worker.
+    if (worker->GetWorkerType() == rpc::WorkerType::WORKER) {
+      TryPendingPopWorkerRequests(worker->GetLanguage());


Better to invoke this function in OnWorkerStarted ? I suggest this because the flag is_pending_registration is set to false in this function and it is related to the concurrent starting worker process count.

We don't only retry pending pop worker requests when is_pending_registration is set to false but also when an idle worker is returned. PushWorker() handles both case.

If we do it in OnWorkerStarted() then we may start more workers than necessary. Imagining a pre-started worker is registered, OnWorkerStarted() is called first, if we retry in this function, then we will start a new worker (since the new worker is not pushed yet) but if we retry in PushWorker then we can simply use the pre-started worker without starting a new one.

SongGuyang · 2022-09-21T03:09:59Z

src/ray/raylet/worker_pool.cc

+    } else if (status == PopWorkerStatus::TooManyStartingWorkerProcesses) {
+      DeleteRuntimeEnvIfPossible(task_spec.SerializedRuntimeEnv());
+      state.pending_pop_worker_requests.emplace_back(
+          PopWorkerRequest{task_spec, callback, allocated_instances_serialized_json});


Do we need to avoid the memory copy of task_spec? Maybe we can optimize it in future and add a TODO here.

I think we need to make a copy here. The caller pass this in as a reference, if we don't make a copy, the caller may destruct the object later on.

SongGuyang · 2022-09-21T03:12:33Z

src/ray/raylet/worker_pool.cc

@@ -1177,13 +1186,14 @@ void WorkerPool::PopWorker(const TaskSpecification &task_spec,
      } else {
        state.starting_workers_to_tasks[startup_token] = std::move(task_info);
      }
+    } else if (status == PopWorkerStatus::TooManyStartingWorkerProcesses) {
+      DeleteRuntimeEnvIfPossible(task_spec.SerializedRuntimeEnv());


Actually, the ideal way is that we don't DeleteRuntimeEnvIfPossible here and reuse it when pop worker next time, right?

Yea, I agree. I'll add a TODO here and address it in the follow-up PR since it's not a regression. Does that sound good to you?

Good for me.

…_cap

SongGuyang

Overall looks good. We can merge it first after you add the TODO comments.

Signed-off-by: Jiajun Yao <[email protected]>

…_cap

jjyao · 2022-09-23T00:21:02Z

Release tests look good: https://buildkite.com/ray-project/release-tests-pr/builds/16008#_. Didn't see improvement or regression.

jjyao added 8 commits September 15, 2022 15:58

Handle starting worker throttling inside worker pool

1a2e44b

Signed-off-by: Jiajun Yao <[email protected]>

up

ee8b365

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into jjyao/worker…

6af27db

…_cap

up

fc541b8

Signed-off-by: Jiajun Yao <[email protected]>

up

68bdc30

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into jjyao/worker…

08c7c61

…_cap

Merge branch 'master' of github.com:ray-project/ray into jjyao/worker…

533f001

…_cap

up

cff7469

Signed-off-by: Jiajun Yao <[email protected]>

jjyao marked this pull request as ready for review September 20, 2022 18:19

jjyao requested review from stephanie-wang and SongGuyang September 20, 2022 18:19

jjyao assigned stephanie-wang and SongGuyang Sep 20, 2022

stephanie-wang approved these changes Sep 20, 2022

View reviewed changes

SongGuyang reviewed Sep 21, 2022

View reviewed changes

jjyao added 2 commits September 21, 2022 07:19

Merge branch 'master' of github.com:ray-project/ray into jjyao/worker…

8bcc5b8

…_cap

Merge branch 'master' of github.com:ray-project/ray into jjyao/worker…

583dbf9

…_cap

SongGuyang approved these changes Sep 22, 2022

View reviewed changes

jjyao added 3 commits September 22, 2022 09:01

up

9c73cb1

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into jjyao/worker…

9c06514

…_cap

Merge branch 'master' of github.com:ray-project/ray into jjyao/worker…

ddb7432

…_cap

jjyao merged commit 8e8ab34 into ray-project:master Sep 23, 2022

jjyao deleted the jjyao/worker_cap branch September 23, 2022 00:25

rickyyx mentioned this pull request Sep 26, 2022

[core][release] dask_on_ray_large_scale_test_no_spilling failed with RayActorError on low memory #28778

Closed

scv119 mentioned this pull request Jan 25, 2023

[Core] Resource availability accounting issue when hitting the max starting worker cap #28252

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle starting worker throttling inside worker pool #28551

Handle starting worker throttling inside worker pool #28551

jjyao commented Sep 15, 2022 •

edited

Loading

stephanie-wang left a comment

stephanie-wang commented Sep 20, 2022

SongGuyang Sep 21, 2022

jjyao Sep 22, 2022

SongGuyang Sep 21, 2022

jjyao Sep 22, 2022

SongGuyang Sep 21, 2022

jjyao Sep 22, 2022

SongGuyang Sep 22, 2022

SongGuyang left a comment

jjyao commented Sep 23, 2022

Handle starting worker throttling inside worker pool #28551

Handle starting worker throttling inside worker pool #28551

Conversation

jjyao commented Sep 15, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

stephanie-wang left a comment

Choose a reason for hiding this comment

stephanie-wang commented Sep 20, 2022

SongGuyang Sep 21, 2022

Choose a reason for hiding this comment

jjyao Sep 22, 2022

Choose a reason for hiding this comment

SongGuyang Sep 21, 2022

Choose a reason for hiding this comment

jjyao Sep 22, 2022

Choose a reason for hiding this comment

SongGuyang Sep 21, 2022

Choose a reason for hiding this comment

jjyao Sep 22, 2022

Choose a reason for hiding this comment

SongGuyang Sep 22, 2022

Choose a reason for hiding this comment

SongGuyang left a comment

Choose a reason for hiding this comment

jjyao commented Sep 23, 2022

jjyao commented Sep 15, 2022 •

edited

Loading