[Core][WorkerPool Reuse 1/n] Consolidate worker reuse code path #30349

scv119 · 2022-11-16T18:50:43Z

Why are these changes needed?

Today we track dynamic_options enabled actor and tasks separately in worker_pool. This results boilerplate code and also make it harder for worker to be cached. This PR simplifies that by keeping track of work's dynamic_options, and compare it against new task when being reused.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

src/ray/raylet/worker_pool.cc

architkulkarni

Overall the direction makes sense, a little confusing though so I just wanted to clarify.

Previously we had three types of task
(1) Actor creation with dynamic options
(2) Actor creation
(3) Task
And two types of worker:
(A) dedicated worker
(B) worker

(1) used its own map (A), while (2) and (3) drew from the list (B).

Prior to this PR, we could actually reuse a (3) worker for (2). But in the current PR, we disallow this. Is this intended / is my understanding correct?
Prior to this PR, we had a map for (A), which allowed (1) to look up the specific worker it needed. (The worker process needed to have been started with the correct command line options dynamic_options). But in the current PR, we got rid of (A) and don't seem to check for the specific worker anywhere (I would expect this logic to appear near where we scan through the list of workers and "skip if the runtime env doesn't match".) So it seems like (1) could pop an arbitrary worker that didn't have the correct dynamic options. Is this an issue?

architkulkarni · 2022-11-17T17:05:43Z

src/ray/raylet/worker.h

-  void MarkBlocked();
-  void MarkUnblocked();
-  bool IsBlocked() const;
+  rpc::WorkerType GetWorkerType() const override;


For my education, why do we mark everything override here? Is it a good practice to always do it?

yeah it's a good practice; and linter won't allow me to compile if i don't do so.

architkulkarni · 2022-11-17T17:09:48Z

src/ray/raylet/worker.h

@@ -111,6 +111,12 @@ class WorkerInterface {
  /// Time when the last task was assigned to this worker.
  virtual const std::chrono::steady_clock::time_point GetAssignedTaskTime() const = 0;

+  /// Number of successful reuse of this worker.


The first time the worker is used, is the reuse count 0 or 1? (Can we add it to this comment?) If 0, should we call it UseCount?

architkulkarni · 2022-11-17T17:27:39Z

src/ray/raylet/worker_pool.cc

+    // Start a new worker process.
+    if (task_spec.HasRuntimeEnv()) {
+      // create runtime env.
+      RAY_LOG(DEBUG) << "Creating runtime env for task/ " << task_spec.TaskId();


Is adding the / here intended?

architkulkarni · 2022-11-17T17:46:37Z

src/ray/raylet/worker_pool.cc

-      idle_of_all_languages_.erase(lit);
-      idle_of_all_languages_map_.erase(worker);
-      break;
+    // Actor worker can't be reused.


The comment sounds like "a worker that has been used for an actor cannot be used again", but do we actually mean "a worker that has been used at least once cannot be reused for an actor"? (That's what the code suggests)

scv119 · 2022-11-17T19:53:12Z

Prior to this PR, we could actually reuse a (3) worker for (2). But in the current PR, we disallow this. Is this intended / is my understanding correct?

ah for some reason I thought dynamic_options are always on for actors...

Prior to this PR, we had a map for (A), which allowed (1) to look up the specific worker it needed. (The worker process needed to have been started with the correct command line options dynamic_options). But in the current PR, we got rid of (A) and don't seem to check for the specific worker anywhere (I would expect this logic to appear near where we scan through the list of workers and "skip if the runtime env doesn't match".) So it seems like (1) could pop an arbitrary worker that didn't have the correct dynamic options. Is this an issue?

good catch. this looks like a regression.

architkulkarni · 2022-11-17T19:54:21Z

From my memory dynamic_options is for Java.

scv119 · 2022-11-18T01:14:19Z

updated by checking the dynamic_options when trying to reuse workers.

rkooo567

LGTM. One nit (regarding iterating all workers for checking dynamic options..). And I think it is the right direction to move dynamic_options to runtime env!

src/ray/raylet/worker_pool.cc

rkooo567 · 2022-11-21T13:56:37Z

src/ray/raylet/worker_pool.cc

-      idle_of_all_languages_map_.erase(worker);
-      break;
+    // Skip if the dynamic_options doesn't match.
+    if (LookupWorkerDynamicOptions(it->first->GetStartupToken()) != dynamic_options) {


Should we index this? This could be a bit expensive to iterate all workers for each iteration? (It's N^2?)

ah actually most of time it's O(N). The LookupWorkerDynamicOptions is only O(1).

…project#30349) Today we track dynamic_options enabled actor and tasks separately in worker_pool. This results boilerplate code and also make it harder for worker to be cached. This PR simplifies that by keeping track of work's dynamic_options, and compare it against new task when being reused. Signed-off-by: Weichen Xu <[email protected]>

scv119 force-pushed the cache-worker branch from bbe7c45 to 25f793c Compare November 16, 2022 18:53

scv119 marked this pull request as ready for review November 16, 2022 18:54

scv119 assigned SongGuyang, fishbone, rkooo567 and architkulkarni Nov 16, 2022

scv119 commented Nov 16, 2022

View reviewed changes

src/ray/raylet/worker_pool.cc Show resolved Hide resolved

scv119 assigned jjyao Nov 16, 2022

architkulkarni reviewed Nov 17, 2022

View reviewed changes

scv119 added the do-not-merge Do not merge this PR! label Nov 17, 2022

scv119 force-pushed the cache-worker branch from 4e8de98 to 16913f8 Compare November 18, 2022 01:11

scv119 changed the title ~~[Core][WorkerPool Reuse 1/n] Consolidate actor/task worker reuse code path~~ [Core][WorkerPool Reuse 1/n] Consolidate worker reuse code path Nov 18, 2022

scv119 removed the do-not-merge Do not merge this PR! label Nov 18, 2022

scv119 force-pushed the cache-worker branch from 16913f8 to a4c27e2 Compare November 20, 2022 06:27

scv119 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 20, 2022

rkooo567 approved these changes Nov 21, 2022

View reviewed changes

scv119 added 11 commits November 23, 2022 14:06

add

01cb002

add

8f1ba2b

add

bfef84b

add

15f45da

add

500bbe1

add

479d8d4

fix

e120dab

add

2feabf9

add

4b15e8e

add

809f7fa

add

855741e

scv119 added 4 commits November 23, 2022 14:06

add

8fb357e

add

555d898

add

91004df

address comments

e9c7509

scv119 force-pushed the cache-worker branch from 4dcc692 to e9c7509 Compare November 23, 2022 22:09

scv119 merged commit 7b81d31 into ray-project:master Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][WorkerPool Reuse 1/n] Consolidate worker reuse code path #30349

[Core][WorkerPool Reuse 1/n] Consolidate worker reuse code path #30349

scv119 commented Nov 16, 2022 •

edited by NKcqx

Loading

architkulkarni left a comment

architkulkarni Nov 17, 2022

scv119 Nov 17, 2022

architkulkarni Nov 17, 2022

architkulkarni Nov 17, 2022 •

edited

Loading

architkulkarni Nov 17, 2022

scv119 commented Nov 17, 2022

architkulkarni commented Nov 17, 2022

scv119 commented Nov 18, 2022

rkooo567 left a comment •

edited

Loading

rkooo567 Nov 21, 2022

scv119 Nov 23, 2022

[Core][WorkerPool Reuse 1/n] Consolidate worker reuse code path #30349

[Core][WorkerPool Reuse 1/n] Consolidate worker reuse code path #30349

Conversation

scv119 commented Nov 16, 2022 • edited by NKcqx Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni left a comment

Choose a reason for hiding this comment

architkulkarni Nov 17, 2022

Choose a reason for hiding this comment

scv119 Nov 17, 2022

Choose a reason for hiding this comment

architkulkarni Nov 17, 2022

Choose a reason for hiding this comment

architkulkarni Nov 17, 2022 • edited Loading

Choose a reason for hiding this comment

architkulkarni Nov 17, 2022

Choose a reason for hiding this comment

scv119 commented Nov 17, 2022

architkulkarni commented Nov 17, 2022

scv119 commented Nov 18, 2022

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

rkooo567 Nov 21, 2022

Choose a reason for hiding this comment

scv119 Nov 23, 2022

Choose a reason for hiding this comment

scv119 commented Nov 16, 2022 •

edited by NKcqx

Loading

architkulkarni Nov 17, 2022 •

edited

Loading

rkooo567 left a comment •

edited

Loading