Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] oom killer policy: group by owner id #31272

Merged
merged 25 commits into from
Jan 20, 2023
Merged

Conversation

clarng
Copy link
Contributor

@clarng clarng commented Dec 21, 2022

Why are these changes needed?

Group by owner worker killing policy. Comparing to the current oom killer policy (simple lifo) it produces less thrashing as it tries to allow tasks of the same owner (the task caller) to execute in favor of tasks from a different owner, thereby allowing progress and freeing up of resources when the group of tasks complete.

Helps in cases such as Tune multiple trials where multiple groups of datasets are processed in parallel, with this policy it will try to process datasets so it completes one by one to reduce resource contention

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@clarng clarng changed the title Gpolicy [core] oom killer policy: group by owner id Dec 22, 2022
@clarng clarng mentioned this pull request Jan 7, 2023
7 tasks
scv119 pushed a commit that referenced this pull request Jan 9, 2023
Why are these changes needed?
[No op change PR] Allow infinite oom retry when value is set to -1.

Some minor simplification to exponential backoff class

To be used by group-by-owner worker killing policy (draft PR #31272)

Signed-off-by: Clarence Ng <[email protected]>
@clarng clarng marked this pull request as ready for review January 11, 2023 19:04
@scv119
Copy link
Contributor

scv119 commented Jan 11, 2023

will take a look tonight! btw, can you add description on what scenario this new policy helps

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me at a high level. I'll leave the detailed review to you @scv119

@ericl ericl removed their assignment Jan 11, 2023
@ericl ericl self-assigned this Jan 11, 2023
@ericl
Copy link
Contributor

ericl commented Jan 11, 2023

Actually, there is one piece I think I misunderstand: shouldn't we kill from the "largest group" always to ensure liveness? My understanding is the algorithm should proceed as follows:

When a raylet needs to select a task to kill due to memory pressure:

  1. Sort all retryable task groups by descending order of size.
  2. Select the current largest group, and kill the newest task in the group. If the task was the last task in the group, do not allow OOM retries for the task. Otherwise, the task can be retried.
  3. If there are no retryable task groups, kill a non-retryable task.

This seems to be different from what is implemented in the policy.

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly cosmetic, ping me for review again once the sort policy is updated!

AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
Why are these changes needed?
[No op change PR] Allow infinite oom retry when value is set to -1.

Some minor simplification to exponential backoff class

To be used by group-by-owner worker killing policy (draft PR #31272)

Signed-off-by: Clarence Ng <[email protected]>
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023
@clarng clarng removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 17, 2023
@ericl ericl removed their assignment Jan 17, 2023
@clarng clarng added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 18, 2023
@clarng
Copy link
Contributor Author

clarng commented Jan 18, 2023

gentle ping @stephanie-wang @scv119

Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some suggestions for cleanup.

src/ray/common/ray_config_def.h Show resolved Hide resolved
src/ray/raylet/worker_killing_policy_group_by_owner.cc Outdated Show resolved Hide resolved
src/ray/raylet/worker_killing_policy_group_by_owner.cc Outdated Show resolved Hide resolved
TaskID owner_id_;

/// Whether the tasks are retriable.
bool retriable_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's cleaner to make this and the owner_id_ const, since these are also the group key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails when i try to make either, seems we need to implement swap / move functions for that to work - added TODO

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be surprised if we need swap / move, probably somewhere we are passing some mutable ref when it should be a const &.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/ray/raylet/worker_killing_policy_group_by_owner.cc Outdated Show resolved Hide resolved
src/ray/raylet/worker_killing_policy_group_by_owner.h Outdated Show resolved Hide resolved
@stephanie-wang
Copy link
Contributor

@clarng, can you look into the above comments briefly before we merge? I'm a bit surprised by them but if it turns out to take >1 day to solve we can just merge this now.

@scv119 scv119 merged commit bb2c58d into ray-project:master Jan 20, 2023
andreapiso pushed a commit to andreapiso/ray that referenced this pull request Jan 22, 2023
Group by owner worker killing policy. Comparing to the current oom killer policy (simple lifo) it produces less thrashing as it tries to allow tasks of the same owner (the task caller) to execute in favor of tasks from a different owner, thereby allowing progress and freeing up of resources when the group of tasks complete.

Helps in cases such as Tune multiple trials where multiple groups of datasets are processed in parallel, with this policy it will try to process datasets so it completes one by one to reduce resource contention


Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Andrea Pisoni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants