-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][WorkerPool Reuse 1/n] Consolidate worker reuse code path #30349
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the direction makes sense, a little confusing though so I just wanted to clarify.
Previously we had three types of task
(1) Actor creation with dynamic options
(2) Actor creation
(3) Task
And two types of worker:
(A) dedicated worker
(B) worker
(1) used its own map (A), while (2) and (3) drew from the list (B).
- Prior to this PR, we could actually reuse a (3) worker for (2). But in the current PR, we disallow this. Is this intended / is my understanding correct?
- Prior to this PR, we had a map for (A), which allowed (1) to look up the specific worker it needed. (The worker process needed to have been started with the correct command line options
dynamic_options
). But in the current PR, we got rid of (A) and don't seem to check for the specific worker anywhere (I would expect this logic to appear near where we scan through the list of workers and "skip if the runtime env doesn't match".) So it seems like (1) could pop an arbitrary worker that didn't have the correct dynamic options. Is this an issue?
src/ray/raylet/worker.h
Outdated
void MarkBlocked(); | ||
void MarkUnblocked(); | ||
bool IsBlocked() const; | ||
rpc::WorkerType GetWorkerType() const override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education, why do we mark everything override
here? Is it a good practice to always do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's a good practice; and linter won't allow me to compile if i don't do so.
src/ray/raylet/worker.h
Outdated
@@ -111,6 +111,12 @@ class WorkerInterface { | |||
/// Time when the last task was assigned to this worker. | |||
virtual const std::chrono::steady_clock::time_point GetAssignedTaskTime() const = 0; | |||
|
|||
/// Number of successful reuse of this worker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first time the worker is used, is the reuse count 0 or 1? (Can we add it to this comment?) If 0, should we call it UseCount?
src/ray/raylet/worker_pool.cc
Outdated
// Start a new worker process. | ||
if (task_spec.HasRuntimeEnv()) { | ||
// create runtime env. | ||
RAY_LOG(DEBUG) << "Creating runtime env for task/ " << task_spec.TaskId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is adding the /
here intended?
src/ray/raylet/worker_pool.cc
Outdated
idle_of_all_languages_.erase(lit); | ||
idle_of_all_languages_map_.erase(worker); | ||
break; | ||
// Actor worker can't be reused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment sounds like "a worker that has been used for an actor cannot be used again", but do we actually mean "a worker that has been used at least once cannot be reused for an actor"? (That's what the code suggests)
ah for some reason I thought dynamic_options are always on for actors...
good catch. this looks like a regression. |
From my memory |
updated by checking the dynamic_options when trying to reuse workers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One nit (regarding iterating all workers for checking dynamic options..). And I think it is the right direction to move dynamic_options to runtime env!
idle_of_all_languages_map_.erase(worker); | ||
break; | ||
// Skip if the dynamic_options doesn't match. | ||
if (LookupWorkerDynamicOptions(it->first->GetStartupToken()) != dynamic_options) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we index this? This could be a bit expensive to iterate all workers for each iteration? (It's N^2?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah actually most of time it's O(N). The LookupWorkerDynamicOptions
is only O(1).
…project#30349) Today we track dynamic_options enabled actor and tasks separately in worker_pool. This results boilerplate code and also make it harder for worker to be cached. This PR simplifies that by keeping track of work's dynamic_options, and compare it against new task when being reused. Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
Today we track dynamic_options enabled actor and tasks separately in worker_pool. This results boilerplate code and also make it harder for worker to be cached. This PR simplifies that by keeping track of work's dynamic_options, and compare it against new task when being reused.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.