Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS] Non-STRICT_PACK PGs should be sorted by resource priority, size #22762

Merged
merged 21 commits into from
Mar 12, 2022

Conversation

jon-chuang
Copy link
Contributor

@jon-chuang jon-chuang commented Mar 2, 2022

Why are these changes needed?

Previously, placement group had suboptimal bin-packing resulting in unexpected placement group stalls for users.

The root cause is lack of implementation for sorting of pg bundles by resource priority and size.

This PR implements a naive priority mechanism for bundles that can be improved upon (and even config by user in the future) in the GCS resource scheduler.

The behaviour is to schedule: "GPU" first, custom resources in int64_t order next, and finally, memory and then "CPU" last.

CC @rkooo567

Related issue number

Fixes: #18309

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Mar 2, 2022

@rkooo567 I am thinking of moving unit tests to gcs_resource_scheduler_test.cc.

It is faster to run, and more directly addresses the logic of scheduling at root of this issue.

Or should the more "integration" style (.py) tests created here be kept?

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Mar 3, 2022

Seems redis-ha failure is due to test_placement_group. It could just be long running with new added tests and not actually reflective of an issue.
Windows one is unrelated.

@rkooo567
Copy link
Contributor

rkooo567 commented Mar 3, 2022

Can you add tests for both? I think we generally do both

@rkooo567 rkooo567 self-assigned this Mar 3, 2022
@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 3, 2022
@rkooo567
Copy link
Contributor

rkooo567 commented Mar 3, 2022

Seems redis-ha failure is due to test_placement_group. It

Can you actually add new tests to test_placement_group_4.py?

@jon-chuang
Copy link
Contributor Author

Ok, done.

btw, I believe placement_group_4 tests were not running previously due to missing:

if __name__ == "__main__":
    sys.exit(pytest.main(["-sv", __file__]))

@jon-chuang
Copy link
Contributor Author

test_placement_group_4 is now timing out for ha gcs. Move to test_placement_group_5?

@jon-chuang
Copy link
Contributor Author

@rkooo567 merge?

@jon-chuang jon-chuang closed this Mar 7, 2022
@jon-chuang jon-chuang reopened this Mar 7, 2022
python/ray/tests/test_placement_group.py Outdated Show resolved Hide resolved
python/ray/tests/test_placement_group_5.py Show resolved Hide resolved
python/ray/tests/test_placement_group_4.py Outdated Show resolved Hide resolved
python/ray/tests/test_placement_group.py Outdated Show resolved Hide resolved
python/ray/tests/test_placement_group_5.py Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_scheduler.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_scheduler.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_scheduler.cc Outdated Show resolved Hide resolved

const auto &result1 =
gcs_resource_scheduler_->Schedule(required_resources_list, scheduling_type);
ASSERT_TRUE(result1.first == gcs::SchedulingResultStatus::SUCCESS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the same test case as the one from the issue?

Copy link
Contributor Author

@jon-chuang jon-chuang Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced issue to minimal reproducible.
I tested it also succeeds on original issue's setup. (and fails prior to sorting)

std::vector<NodeID> node_ids;
for (int i = 0; i < 6; i++) {
node_ids.emplace_back(NodeID::FromRandom());
AddClusterResources(node_ids.back(), resources_list[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you actually insert from the backward? 5->0. I think that will be close to generating the error scenario? Also did this test fail before this PR?

Copy link
Contributor Author

@jon-chuang jon-chuang Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failed before PR (you can comment out sort).

Order doesn't really matter, the Score function is greedy about placing on the largest node, so that node has as much free remaining resource as possible. This causes order not to matter too much.

However I have improved the order slightly.

@rkooo567 rkooo567 added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Mar 8, 2022
@jon-chuang
Copy link
Contributor Author

Test_placement_group_5 timeout for python medium and fail for windows build and test.

Maybe I'll reduce size of integration test.

Also, not possible to view logs on windows.

@jon-chuang
Copy link
Contributor Author

Again failed on windows. Very strange.

@rkooo567
Copy link
Contributor

rkooo567 commented Mar 8, 2022

Maybe this also could mean the test is flaky

@rkooo567
Copy link
Contributor

rkooo567 commented Mar 8, 2022

If you run the same test like 50 times in a row, does it always succeed?

src/ray/gcs/gcs_server/gcs_resource_scheduler.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_resource_scheduler.cc Outdated Show resolved Hide resolved
@jon-chuang
Copy link
Contributor Author

test failures are now related to actor and client.

@rkooo567
Copy link
Contributor

That was broken in the master. Can you merge the latest master?

@rkooo567
Copy link
Contributor

Failed tests are unrelated

@rkooo567 rkooo567 merged commit 0b54d9c into ray-project:master Mar 12, 2022
@rkooo567
Copy link
Contributor

Great job merging this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
2 participants