-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] oom killer policy: group by owner id #31272
Conversation
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Why are these changes needed? [No op change PR] Allow infinite oom retry when value is set to -1. Some minor simplification to exponential backoff class To be used by group-by-owner worker killing policy (draft PR #31272) Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
will take a look tonight! btw, can you add description on what scenario this new policy helps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM to me at a high level. I'll leave the detailed review to you @scv119
Actually, there is one piece I think I misunderstand: shouldn't we kill from the "largest group" always to ensure liveness? My understanding is the algorithm should proceed as follows: When a raylet needs to select a task to kill due to memory pressure:
This seems to be different from what is implemented in the policy. |
Signed-off-by: Clarence Ng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly cosmetic, ping me for review again once the sort policy is updated!
Why are these changes needed? [No op change PR] Allow infinite oom retry when value is set to -1. Some minor simplification to exponential backoff class To be used by group-by-owner worker killing policy (draft PR #31272) Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
gentle ping @stephanie-wang @scv119 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just some suggestions for cleanup.
TaskID owner_id_; | ||
|
||
/// Whether the tasks are retriable. | ||
bool retriable_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's cleaner to make this and the owner_id_ const, since these are also the group key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It fails when i try to make either, seems we need to implement swap / move functions for that to work - added TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be surprised if we need swap / move, probably somewhere we are passing some mutable ref when it should be a const &.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attaching the error
https://gist.github.com/clarng/3aa53a7fea007f2dc0e83f008bc5d355
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
Signed-off-by: Clarence Ng <[email protected]>
@clarng, can you look into the above comments briefly before we merge? I'm a bit surprised by them but if it turns out to take >1 day to solve we can just merge this now. |
Signed-off-by: Clarence Ng <[email protected]>
Group by owner worker killing policy. Comparing to the current oom killer policy (simple lifo) it produces less thrashing as it tries to allow tasks of the same owner (the task caller) to execute in favor of tasks from a different owner, thereby allowing progress and freeing up of resources when the group of tasks complete. Helps in cases such as Tune multiple trials where multiple groups of datasets are processed in parallel, with this policy it will try to process datasets so it completes one by one to reduce resource contention Signed-off-by: Clarence Ng <[email protected]> Signed-off-by: Andrea Pisoni <[email protected]>
Why are these changes needed?
Group by owner worker killing policy. Comparing to the current oom killer policy (simple lifo) it produces less thrashing as it tries to allow tasks of the same owner (the task caller) to execute in favor of tasks from a different owner, thereby allowing progress and freeing up of resources when the group of tasks complete.
Helps in cases such as Tune multiple trials where multiple groups of datasets are processed in parallel, with this policy it will try to process datasets so it completes one by one to reduce resource contention
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.