Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Placement Group] Fix the high load bug from the placement group #19277

Merged
merged 5 commits into from
Oct 12, 2021

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Oct 10, 2021

Why are these changes needed?

Here's the current flow of creating a placement group.

  • GCS finds nodes to schedule a pg.
  • GCS prepares / commits bundles to nodes.
  • Once a bundle is committed, raylet sends a RPC to create new pg formatted resources.
    • We need this RPC because currently the existing resource pulling path cannot be used to "create" new resources. It only "updates existing resources".
  • The problem was here. As you can see from the screenshot below, this RPC has a huge overhead which causes instability.
    • The problem was that in the existing code path, it always calls GetResourceTotal which generates the grpc malloced map that contains the whole resources of the node.
    • So, after it sends the whole resources snapshot to the GCS, GCS detects the new resources and creates them. This had huge overhead because the size of resource keeps growing as you create more placement groups (because each pg creates new pg formatted resources).

This PR fixes the issue by only updating newly created resources.

Screen Shot 2021-10-11 at 12 30 07 AM

I verified #18409 is fixed after this PR is merged (and added the proper test).

NOTE: We should eventually would like to consolidate "creating new resources" into the existing resource pulling path. But this approach has pros and cons. e.g., one of cons is that placement group scheduling will has constant overhead, which resource pulling interval (because it takes at least 100ms to propagate newly created resources information to other nodes), though we can add additional mechanism to avoid that. we will discuss whether or not if we need this in the design doc about the pg performance. For making the current "known workloads" work, this PR should be sufficient.

Related issue number

Fixes #18409 and #19098 (verified that this is also fixed).

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

/// \param resource_map_filter When returning the resource map, the returned result will
/// only contain the keys in the filter.
virtual ray::gcs::NodeResourceInfoAccessor::ResourceMap GetResourceTotals(
const std::unordered_map<std::string, double> &resource_map_filter) const = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: This method is only used by the placement group now. Technically the value is not required, but if we only pass the list of resource name, we need unnecessary copy in the code below.

Screen Shot 2021-10-11 at 12 37 43 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to change this to be like; if you think this is better, lmk

  • GetResourceTotals(std::vectorstd::string resources_names): This will incur additional copy to generate a new vector, but the overhead might be quite small.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment says only key is used.

@rkooo567 rkooo567 changed the title [WIP][Placement Group] Fix the high load bug from the placement group [Placement Group] Fix the high load bug from the placement group Oct 11, 2021
@@ -102,7 +102,8 @@ void NewPlacementGroupResourceManager::CommitBundle(
const auto &string_id_map = cluster_resource_scheduler_->GetStringIdMap();
const auto &task_resource_instances = *bundle_state->resources_;

for (const auto &resource : bundle_spec.GetFormattedResources()) {
const auto &resources = bundle_spec.GetFormattedResources();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the same?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just notice this one is also used later.

Comment on lines +1132 to +1134
if (resource_map_filter.count(resource_name) == 0u) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this probably looks better
if (!resource_map_filter.count(resource_name)) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this was the original code, and the lint somehow complains it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehh, @mwtian can we disable this lint rule?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0u is really strange

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I wait until linter rule is changed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lint rule is readability-implicit-bool-conversion. Writing resource_map_filter.count(resource_name) == 0 should also work. My opinion is that it is better to be explicit in integer or pointer to boolean conversions. For example, this helps avoiding bugs where an enum / integer status code is treated as a boolean. But if the consensus is to disable (1) integer-to-boolean implicit conversion check, or (2) pointer-to-boolean implicit conversion check or both, I can do that too.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we ensure this doesn't regress in the future? Do we have a performance number trackable in the nightly dashboard?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 11, 2021
@rkooo567
Copy link
Contributor Author

@ericl I am planning to add tests to guard against regression (in the next PR). (I think I will write more details in the performance doc), but the brief thought is to

@rkooo567
Copy link
Contributor Author

Also for the correctness regression, the modified unit test should cover (that’s the repro, and I verified it didn’t work before this pr)

@rkooo567
Copy link
Contributor Author

rkooo567 commented Oct 12, 2021

for more details about tests; #18919 (and there might be more stress tests added based on the system / performance analysis document I will write next week to avoid regressions)

Screen Shot 2021-10-12 at 4 15 13 AM

@rkooo567 rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 12, 2021
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan sounds good!

@ericl ericl merged commit 4360b99 into ray-project:master Oct 12, 2021
ericl added a commit that referenced this pull request Oct 12, 2021
wuisawesome pushed a commit that referenced this pull request Oct 12, 2021
rkooo567 added a commit to rkooo567/ray that referenced this pull request Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Placement groups hang because resources updates are lost under stress
5 participants