-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Placement Group] Fix the high load bug from the placement group #19277
Conversation
/// \param resource_map_filter When returning the resource map, the returned result will | ||
/// only contain the keys in the filter. | ||
virtual ray::gcs::NodeResourceInfoAccessor::ResourceMap GetResourceTotals( | ||
const std::unordered_map<std::string, double> &resource_map_filter) const = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to change this to be like; if you think this is better, lmk
- GetResourceTotals(std::vectorstd::string resources_names): This will incur additional copy to generate a new vector, but the overhead might be quite small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment says only key is used.
@@ -102,7 +102,8 @@ void NewPlacementGroupResourceManager::CommitBundle( | |||
const auto &string_id_map = cluster_resource_scheduler_->GetStringIdMap(); | |||
const auto &task_resource_instances = *bundle_state->resources_; | |||
|
|||
for (const auto &resource : bundle_spec.GetFormattedResources()) { | |||
const auto &resources = bundle_spec.GetFormattedResources(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I just notice this one is also used later.
if (resource_map_filter.count(resource_name) == 0u) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this probably looks better
if (!resource_map_filter.count(resource_name)) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this was the original code, and the lint somehow complains it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehh, @mwtian can we disable this lint rule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0u is really strange
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I wait until linter rule is changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lint rule is readability-implicit-bool-conversion
. Writing resource_map_filter.count(resource_name) == 0
should also work. My opinion is that it is better to be explicit in integer or pointer to boolean conversions. For example, this helps avoiding bugs where an enum / integer status code is treated as a boolean. But if the consensus is to disable (1) integer-to-boolean implicit conversion check, or (2) pointer-to-boolean implicit conversion check or both, I can do that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we ensure this doesn't regress in the future? Do we have a performance number trackable in the nightly dashboard?
@ericl I am planning to add tests to guard against regression (in the next PR). (I think I will write more details in the performance doc), but the brief thought is to
|
Also for the correctness regression, the modified unit test should cover (that’s the repro, and I verified it didn’t work before this pr) |
for more details about tests; #18919 (and there might be more stress tests added based on the system / performance analysis document I will write next week to avoid regressions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plan sounds good!
…ement group (ray-project#19277)" (ray-project#19327)" This reverts commit 8c152bd.
Why are these changes needed?
Here's the current flow of creating a placement group.
GetResourceTotal
which generates the grpc malloced map that contains the whole resources of the node.This PR fixes the issue by only updating newly created resources.
I verified #18409 is fixed after this PR is merged (and added the proper test).
NOTE: We should eventually would like to consolidate "creating new resources" into the existing resource pulling path. But this approach has pros and cons. e.g., one of cons is that placement group scheduling will has constant overhead, which resource pulling interval (because it takes at least 100ms to propagate newly created resources information to other nodes), though we can add additional mechanism to avoid that. we will discuss whether or not if we need this in the design doc about the pg performance. For making the current "known workloads" work, this PR should be sufficient.
Related issue number
Fixes #18409 and #19098 (verified that this is also fixed).
Checks
scripts/format.sh
to lint the changes in this PR.