[Placement Group] Fix the high load bug from the placement group #19277

rkooo567 · 2021-10-10T14:21:54Z

Why are these changes needed?

Here's the current flow of creating a placement group.

GCS finds nodes to schedule a pg.
GCS prepares / commits bundles to nodes.
Once a bundle is committed, raylet sends a RPC to create new pg formatted resources.
- We need this RPC because currently the existing resource pulling path cannot be used to "create" new resources. It only "updates existing resources".
The problem was here. As you can see from the screenshot below, this RPC has a huge overhead which causes instability.
- The problem was that in the existing code path, it always calls GetResourceTotal which generates the grpc malloced map that contains the whole resources of the node.
- So, after it sends the whole resources snapshot to the GCS, GCS detects the new resources and creates them. This had huge overhead because the size of resource keeps growing as you create more placement groups (because each pg creates new pg formatted resources).

This PR fixes the issue by only updating newly created resources.

I verified #18409 is fixed after this PR is merged (and added the proper test).

NOTE: We should eventually would like to consolidate "creating new resources" into the existing resource pulling path. But this approach has pros and cons. e.g., one of cons is that placement group scheduling will has constant overhead, which resource pulling interval (because it takes at least 100ms to propagate newly created resources information to other nodes), though we can add additional mechanism to avoid that. we will discuss whether or not if we need this in the design doc about the pg performance. For making the current "known workloads" work, this PR should be sufficient.

Related issue number

Fixes #18409 and #19098 (verified that this is also fixed).

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2021-10-11T07:39:19Z

src/ray/raylet/scheduling/cluster_resource_scheduler_interface.h

+  /// \param resource_map_filter When returning the resource map, the returned result will
+  /// only contain the keys in the filter.
+  virtual ray::gcs::NodeResourceInfoAccessor::ResourceMap GetResourceTotals(
+      const std::unordered_map<std::string, double> &resource_map_filter) const = 0;


NOTE: This method is only used by the placement group now. Technically the value is not required, but if we only pass the list of resource name, we need unnecessary copy in the code below.

I am happy to change this to be like; if you think this is better, lmk

GetResourceTotals(std::vectorstd::string resources_names): This will incur additional copy to generate a new vector, but the overhead might be quite small.

maybe add a comment says only key is used.

fishbone · 2021-10-11T18:08:02Z

src/ray/raylet/placement_group_resource_manager.cc

@@ -102,7 +102,8 @@ void NewPlacementGroupResourceManager::CommitBundle(
  const auto &string_id_map = cluster_resource_scheduler_->GetStringIdMap();
  const auto &task_resource_instances = *bundle_state->resources_;

-  for (const auto &resource : bundle_spec.GetFormattedResources()) {
+  const auto &resources = bundle_spec.GetFormattedResources();


This should be the same?

Sorry, I just notice this one is also used later.

fishbone · 2021-10-11T18:09:25Z

src/ray/raylet/scheduling/cluster_resource_scheduler.cc

+    if (resource_map_filter.count(resource_name) == 0u) {
+      continue;
+    }


I feel this probably looks better
if (!resource_map_filter.count(resource_name)) {

Actually this was the original code, and the lint somehow complains it...

https://buildkite.com/ray-project/ray-builders-pr/builds/15649#f94e396e-c328-4c0b-a267-909aa420afd1

Ehh, @mwtian can we disable this lint rule?

0u is really strange

Should I wait until linter rule is changed?

The lint rule is readability-implicit-bool-conversion. Writing resource_map_filter.count(resource_name) == 0 should also work. My opinion is that it is better to be explicit in integer or pointer to boolean conversions. For example, this helps avoiding bugs where an enum / integer status code is treated as a boolean. But if the consensus is to disable (1) integer-to-boolean implicit conversion check, or (2) pointer-to-boolean implicit conversion check or both, I can do that too.

src/ray/raylet/scheduling/cluster_resource_scheduler.h

ericl

How do we ensure this doesn't regress in the future? Do we have a performance number trackable in the nightly dashboard?

rkooo567 · 2021-10-12T08:40:37Z

@ericl I am planning to add tests to guard against regression (in the next PR). (I think I will write more details in the performance doc), but the brief thought is to

add pg related stats to the benchmark script
Write a long running test that checks avg (or p99) creation time + failing if the (max creation - min creation) is too big (which is the problem in the repro script in [Bug] [Placement group] Placement group performance gets slower as time goes #19098

rkooo567 · 2021-10-12T08:48:58Z

Also for the correctness regression, the modified unit test should cover (that’s the repro, and I verified it didn’t work before this pr)

rkooo567 · 2021-10-12T10:52:05Z

for more details about tests; #18919 (and there might be more stress tests added based on the system / performance analysis document I will write next week to avoid regressions)

ericl

Plan sounds good!

…oup (#19277)" This reverts commit 4360b99.

…oup (#19277)" (#19327) This reverts commit 4360b99.

…ement group (ray-project#19277)" (ray-project#19327)" This reverts commit 8c152bd.

rkooo567 added 3 commits October 10, 2021 07:21

Fix the high load bug from the placement group

3f41d0d

Merge branch 'master' into improve-pg-load

b09c3e1

d

d068bbb

rkooo567 commented Oct 11, 2021

View reviewed changes

rkooo567 changed the title ~~[WIP][Placement Group] Fix the high load bug from the placement group~~ [Placement Group] Fix the high load bug from the placement group Oct 11, 2021

rkooo567 mentioned this pull request Oct 11, 2021

[Bug] [Placement group] Placement group performance gets slower as time goes #19098

Closed

2 tasks

rkooo567 assigned scv119, fishbone and ericl Oct 11, 2021

fishbone reviewed Oct 11, 2021

View reviewed changes

src/ray/raylet/scheduling/cluster_resource_scheduler.h Show resolved Hide resolved

ericl reviewed Oct 11, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 11, 2021

rkooo567 added 2 commits October 12, 2021 04:12

Merge branch 'master' into improve-pg-load

65c64c9

done

1c281ce

rkooo567 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 12, 2021

ericl approved these changes Oct 12, 2021

View reviewed changes

ericl merged commit 4360b99 into ray-project:master Oct 12, 2021

ericl added a commit that referenced this pull request Oct 12, 2021

Revert "[Placement Group] Fix the high load bug from the placement gr…

f4d462c

…oup (#19277)" This reverts commit 4360b99.

ericl mentioned this pull request Oct 12, 2021

[Hotfix] Revert "[Placement Group] Fix the high load bug from the placement group" #19327

Merged

wuisawesome pushed a commit that referenced this pull request Oct 12, 2021

Revert "[Placement Group] Fix the high load bug from the placement gr…

8c152bd

…oup (#19277)" (#19327) This reverts commit 4360b99.

rkooo567 added a commit to rkooo567/ray that referenced this pull request Oct 12, 2021

Revert "Revert "[Placement Group] Fix the high load bug from the plac…

b7bc899

…ement group (ray-project#19277)" (ray-project#19327)" This reverts commit 8c152bd.

rkooo567 mentioned this pull request Oct 12, 2021

Revert "Revert "[Placement Group] Fix the high load bug from the plac… #19330

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Placement Group] Fix the high load bug from the placement group #19277

[Placement Group] Fix the high load bug from the placement group #19277

rkooo567 commented Oct 10, 2021 •

edited

Loading

rkooo567 Oct 11, 2021

rkooo567 Oct 11, 2021

scv119 Oct 11, 2021

fishbone Oct 11, 2021

fishbone Oct 11, 2021

fishbone Oct 11, 2021

rkooo567 Oct 11, 2021

rkooo567 Oct 11, 2021

scv119 Oct 11, 2021

ericl Oct 11, 2021

ericl Oct 11, 2021

rkooo567 Oct 12, 2021

mwtian Oct 13, 2021

ericl left a comment

rkooo567 commented Oct 12, 2021

rkooo567 commented Oct 12, 2021

rkooo567 commented Oct 12, 2021 •

edited

Loading

ericl left a comment

[Placement Group] Fix the high load bug from the placement group #19277

[Placement Group] Fix the high load bug from the placement group #19277

Conversation

rkooo567 commented Oct 10, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

rkooo567 commented Oct 12, 2021

rkooo567 commented Oct 12, 2021

rkooo567 commented Oct 12, 2021 • edited Loading

ericl left a comment

Choose a reason for hiding this comment

rkooo567 commented Oct 10, 2021 •

edited

Loading

rkooo567 commented Oct 12, 2021 •

edited

Loading