[core] oom killer policy: group by owner id #31272

clarng · 2022-12-21T17:47:15Z

Why are these changes needed?

Group by owner worker killing policy. Comparing to the current oom killer policy (simple lifo) it produces less thrashing as it tries to allow tasks of the same owner (the task caller) to execute in favor of tasks from a different owner, thereby allowing progress and freeing up of resources when the group of tasks complete.

Helps in cases such as Tune multiple trials where multiple groups of datasets are processed in parallel, with this policy it will try to process datasets so it completes one by one to reduce resource contention

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Clarence Ng <[email protected]>

Why are these changes needed? [No op change PR] Allow infinite oom retry when value is set to -1. Some minor simplification to exponential backoff class To be used by group-by-owner worker killing policy (draft PR #31272) Signed-off-by: Clarence Ng <[email protected]>

Signed-off-by: Clarence Ng <[email protected]>

scv119 · 2023-01-11T20:01:48Z

will take a look tonight! btw, can you add description on what scenario this new policy helps

src/ray/raylet/worker_killing_policy_group_by_owner.h

ericl

LGTM to me at a high level. I'll leave the detailed review to you @scv119

ericl · 2023-01-11T21:11:22Z

Actually, there is one piece I think I misunderstand: shouldn't we kill from the "largest group" always to ensure liveness? My understanding is the algorithm should proceed as follows:

When a raylet needs to select a task to kill due to memory pressure:

Sort all retryable task groups by descending order of size.
Select the current largest group, and kill the newest task in the group. If the task was the last task in the group, do not allow OOM retries for the task. Otherwise, the task can be retried.
If there are no retryable task groups, kill a non-retryable task.

This seems to be different from what is implemented in the policy.

Signed-off-by: Clarence Ng <[email protected]>

stephanie-wang

Mostly cosmetic, ping me for review again once the sort policy is updated!

src/ray/raylet/worker_killing_policy_group_by_owner.cc

src/ray/raylet/worker_killing_policy_group_by_owner.h

src/ray/raylet/worker_killing_policy_group_by_owner.cc

src/ray/raylet/worker_killing_policy_group_by_owner.h

Why are these changes needed? [No op change PR] Allow infinite oom retry when value is set to -1. Some minor simplification to exponential backoff class To be used by group-by-owner worker killing policy (draft PR #31272) Signed-off-by: Clarence Ng <[email protected]>

Signed-off-by: Clarence Ng <[email protected]>

src/ray/raylet/worker_killing_policy_group_by_owner_test.cc

Signed-off-by: Clarence Ng <[email protected]>

clarng · 2023-01-18T15:29:59Z

gentle ping @stephanie-wang @scv119

stephanie-wang

LGTM! Just some suggestions for cleanup.

src/ray/common/ray_config_def.h

src/ray/raylet/worker_killing_policy_group_by_owner.cc

stephanie-wang · 2023-01-18T15:58:29Z

src/ray/raylet/worker_killing_policy_group_by_owner.h

+  TaskID owner_id_;
+
+  /// Whether the tasks are retriable.
+  bool retriable_;


I think it's cleaner to make this and the owner_id_ const, since these are also the group key.

It fails when i try to make either, seems we need to implement swap / move functions for that to work - added TODO

I would be surprised if we need swap / move, probably somewhere we are passing some mutable ref when it should be a const &.

Attaching the error

https://gist.github.com/clarng/3aa53a7fea007f2dc0e83f008bc5d355

src/ray/raylet/worker_killing_policy_group_by_owner.cc

src/ray/raylet/worker_killing_policy_group_by_owner.h

Signed-off-by: Clarence Ng <[email protected]>

stephanie-wang · 2023-01-19T16:17:53Z

@clarng, can you look into the above comments briefly before we merge? I'm a bit surprised by them but if it turns out to take >1 day to solve we can just merge this now.

Signed-off-by: Clarence Ng <[email protected]>

Group by owner worker killing policy. Comparing to the current oom killer policy (simple lifo) it produces less thrashing as it tries to allow tasks of the same owner (the task caller) to execute in favor of tasks from a different owner, thereby allowing progress and freeing up of resources when the group of tasks complete. Helps in cases such as Tune multiple trials where multiple groups of datasets are processed in parallel, with this policy it will try to process datasets so it completes one by one to reduce resource contention Signed-off-by: Clarence Ng <[email protected]> Signed-off-by: Andrea Pisoni <[email protected]>

clarng added 5 commits December 19, 2022 21:30

[core] add option for raylet to inform whether a task should be retried

1410fce

Signed-off-by: Clarence Ng <[email protected]>

[core] add option for raylet to inform whether a task should be retried

f6f33b2

Signed-off-by: Clarence Ng <[email protected]>

[core] group by owner policy

b03de17

Signed-off-by: Clarence Ng <[email protected]>

[core] group by owner policy

316b823

Signed-off-by: Clarence Ng <[email protected]>

[core] group by owner policy

68ca544

Signed-off-by: Clarence Ng <[email protected]>

clarng changed the title ~~Gpolicy~~ [core] oom killer policy: group by owner id Dec 22, 2022

clarng added 3 commits January 4, 2023 09:38

Merge branch 'master' of https://github.com/ray-project/ray into gpolicy

e1b04c8

Signed-off-by: Clarence Ng <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into gpolicy

a7f9a9c

Signed-off-by: Clarence Ng <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into gpolicy

6edb922

clarng mentioned this pull request Jan 7, 2023

[core] allow infinite oom retry #31509

Merged

7 tasks

clarng added 7 commits January 9, 2023 21:47

Merge branch 'master' of https://github.com/ray-project/ray into gpolicy

0b85002

Signed-off-by: Clarence Ng <[email protected]>

[core] oom killer policy: group by owner id

fd959af

Signed-off-by: Clarence Ng <[email protected]>

[core] oom killer policy: group by owner id

db1305b

Signed-off-by: Clarence Ng <[email protected]>

[core] oom killer policy: group by owner id

6cf64f3

Signed-off-by: Clarence Ng <[email protected]>

[core] oom killer policy: group by owner id

de6b9d4

Signed-off-by: Clarence Ng <[email protected]>

[core] oom killer policy: group by owner id

a4442a5

Signed-off-by: Clarence Ng <[email protected]>

[core] oom killer policy: group by owner id

3650050

Signed-off-by: Clarence Ng <[email protected]>

clarng requested review from stephanie-wang, rkooo567, scv119 and ericl January 11, 2023 19:04

clarng assigned ericl, scv119 and stephanie-wang Jan 11, 2023

clarng marked this pull request as ready for review January 11, 2023 19:04

ericl reviewed Jan 11, 2023

View reviewed changes

src/ray/raylet/worker_killing_policy_group_by_owner.h Outdated Show resolved Hide resolved

ericl reviewed Jan 11, 2023

View reviewed changes

ericl removed their assignment Jan 11, 2023

ericl self-assigned this Jan 11, 2023

Merge branch 'master' of https://github.com/ray-project/ray into gpolicy

12ece5e

Signed-off-by: Clarence Ng <[email protected]>

stephanie-wang requested changes Jan 12, 2023

View reviewed changes

scv119 reviewed Jan 12, 2023

View reviewed changes

src/ray/raylet/worker_killing_policy_group_by_owner.cc Show resolved Hide resolved

scv119 reviewed Jan 12, 2023

View reviewed changes

src/ray/raylet/worker_killing_policy_group_by_owner.h Show resolved Hide resolved

scv119 reviewed Jan 12, 2023

View reviewed changes

src/ray/raylet/worker_killing_policy_group_by_owner.h Outdated Show resolved Hide resolved

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 13, 2023

clarng added 3 commits January 16, 2023 10:08

[core] oom killer policy: group by owner id

3eded6e

Signed-off-by: Clarence Ng <[email protected]>

[core] oom killer policy: group by owner id

2675d65

Signed-off-by: Clarence Ng <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into gpolicy

ea26166

Signed-off-by: Clarence Ng <[email protected]>

clarng removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 17, 2023

ericl reviewed Jan 17, 2023

View reviewed changes

src/ray/raylet/worker_killing_policy_group_by_owner_test.cc Show resolved Hide resolved

ericl removed their assignment Jan 17, 2023

clarng added 2 commits January 17, 2023 13:53

[core] oom killer policy: group by owner id

3269f65

Signed-off-by: Clarence Ng <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into gpolicy

9b5b5a4

Signed-off-by: Clarence Ng <[email protected]>

clarng added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 18, 2023

stephanie-wang approved these changes Jan 18, 2023

View reviewed changes

clarng added 2 commits January 18, 2023 16:26

[core] oom killer policy: group by owner id

2211169

Signed-off-by: Clarence Ng <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into gpolicy

b661ef5

Signed-off-by: Clarence Ng <[email protected]>

clarng mentioned this pull request Jan 19, 2023

[core] release test for nested air (tune) oom #31768

Merged

7 tasks

Merge branch 'master' of https://github.com/ray-project/ray into gpolicy

417ba36

Signed-off-by: Clarence Ng <[email protected]>

[core] oom killer policy: group by owner id

afcb951

Signed-off-by: Clarence Ng <[email protected]>

scv119 merged commit bb2c58d into ray-project:master Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] oom killer policy: group by owner id #31272

[core] oom killer policy: group by owner id #31272

clarng commented Dec 21, 2022 •

edited

Loading

scv119 commented Jan 11, 2023

ericl left a comment

ericl commented Jan 11, 2023

stephanie-wang left a comment

clarng commented Jan 18, 2023

stephanie-wang left a comment

stephanie-wang Jan 18, 2023

clarng Jan 19, 2023

stephanie-wang Jan 19, 2023

clarng Jan 19, 2023

stephanie-wang commented Jan 19, 2023

[core] oom killer policy: group by owner id #31272

[core] oom killer policy: group by owner id #31272

Conversation

clarng commented Dec 21, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

scv119 commented Jan 11, 2023

ericl left a comment

Choose a reason for hiding this comment

ericl commented Jan 11, 2023

stephanie-wang left a comment

Choose a reason for hiding this comment

clarng commented Jan 18, 2023

stephanie-wang left a comment

Choose a reason for hiding this comment

stephanie-wang Jan 18, 2023

Choose a reason for hiding this comment

clarng Jan 19, 2023

Choose a reason for hiding this comment

stephanie-wang Jan 19, 2023

Choose a reason for hiding this comment

clarng Jan 19, 2023

Choose a reason for hiding this comment

stephanie-wang commented Jan 19, 2023

clarng commented Dec 21, 2022 •

edited

Loading