Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][WorkerPool Reuse 1/n] Consolidate worker reuse code path #30349

Merged
merged 15 commits into from
Nov 24, 2022

Conversation

scv119
Copy link
Contributor

@scv119 scv119 commented Nov 16, 2022

Why are these changes needed?

Today we track dynamic_options enabled actor and tasks separately in worker_pool. This results boilerplate code and also make it harder for worker to be cached. This PR simplifies that by keeping track of work's dynamic_options, and compare it against new task when being reused.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the direction makes sense, a little confusing though so I just wanted to clarify.

Previously we had three types of task
(1) Actor creation with dynamic options
(2) Actor creation
(3) Task
And two types of worker:
(A) dedicated worker
(B) worker

(1) used its own map (A), while (2) and (3) drew from the list (B).

  • Prior to this PR, we could actually reuse a (3) worker for (2). But in the current PR, we disallow this. Is this intended / is my understanding correct?
  • Prior to this PR, we had a map for (A), which allowed (1) to look up the specific worker it needed. (The worker process needed to have been started with the correct command line options dynamic_options). But in the current PR, we got rid of (A) and don't seem to check for the specific worker anywhere (I would expect this logic to appear near where we scan through the list of workers and "skip if the runtime env doesn't match".) So it seems like (1) could pop an arbitrary worker that didn't have the correct dynamic options. Is this an issue?

void MarkBlocked();
void MarkUnblocked();
bool IsBlocked() const;
rpc::WorkerType GetWorkerType() const override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education, why do we mark everything override here? Is it a good practice to always do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's a good practice; and linter won't allow me to compile if i don't do so.

@@ -111,6 +111,12 @@ class WorkerInterface {
/// Time when the last task was assigned to this worker.
virtual const std::chrono::steady_clock::time_point GetAssignedTaskTime() const = 0;

/// Number of successful reuse of this worker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first time the worker is used, is the reuse count 0 or 1? (Can we add it to this comment?) If 0, should we call it UseCount?

// Start a new worker process.
if (task_spec.HasRuntimeEnv()) {
// create runtime env.
RAY_LOG(DEBUG) << "Creating runtime env for task/ " << task_spec.TaskId();
Copy link
Contributor

@architkulkarni architkulkarni Nov 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is adding the / here intended?

idle_of_all_languages_.erase(lit);
idle_of_all_languages_map_.erase(worker);
break;
// Actor worker can't be reused.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment sounds like "a worker that has been used for an actor cannot be used again", but do we actually mean "a worker that has been used at least once cannot be reused for an actor"? (That's what the code suggests)

@scv119
Copy link
Contributor Author

scv119 commented Nov 17, 2022

Prior to this PR, we could actually reuse a (3) worker for (2). But in the current PR, we disallow this. Is this intended / is my understanding correct?

ah for some reason I thought dynamic_options are always on for actors...

Prior to this PR, we had a map for (A), which allowed (1) to look up the specific worker it needed. (The worker process needed to have been started with the correct command line options dynamic_options). But in the current PR, we got rid of (A) and don't seem to check for the specific worker anywhere (I would expect this logic to appear near where we scan through the list of workers and "skip if the runtime env doesn't match".) So it seems like (1) could pop an arbitrary worker that didn't have the correct dynamic options. Is this an issue?

good catch. this looks like a regression.

@architkulkarni
Copy link
Contributor

From my memory dynamic_options is for Java.

@scv119 scv119 added the do-not-merge Do not merge this PR! label Nov 17, 2022
@scv119 scv119 changed the title [Core][WorkerPool Reuse 1/n] Consolidate actor/task worker reuse code path [Core][WorkerPool Reuse 1/n] Consolidate worker reuse code path Nov 18, 2022
@scv119 scv119 removed the do-not-merge Do not merge this PR! label Nov 18, 2022
@scv119
Copy link
Contributor Author

scv119 commented Nov 18, 2022

updated by checking the dynamic_options when trying to reuse workers.

@scv119 scv119 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Nov 20, 2022
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One nit (regarding iterating all workers for checking dynamic options..). And I think it is the right direction to move dynamic_options to runtime env!

src/ray/raylet/worker_pool.cc Outdated Show resolved Hide resolved
idle_of_all_languages_map_.erase(worker);
break;
// Skip if the dynamic_options doesn't match.
if (LookupWorkerDynamicOptions(it->first->GetStartupToken()) != dynamic_options) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we index this? This could be a bit expensive to iterate all workers for each iteration? (It's N^2?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah actually most of time it's O(N). The LookupWorkerDynamicOptions is only O(1).

@scv119 scv119 merged commit 7b81d31 into ray-project:master Nov 24, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…project#30349)

Today we track dynamic_options enabled actor and tasks separately in worker_pool. This results boilerplate code and also make it harder for worker to be cached. This PR simplifies that by keeping track of work's dynamic_options, and compare it against new task when being reused.

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants