[core][pg] Fix pg bundles can't reschedule bug when the two nodes both dead #24875

larrylian · 2022-05-17T08:07:05Z

Why are these changes needed?

When a bundle is rescheduled due to node death, if other bundles of this pg also trigger rescheduling due to node death, there will be a bug that the bundle cannot be scheduled.

Reason:
step 1: Node A is down, and then bundle 1 of PG deployed on this node enters this GcsPlacementGroupManager::OnNodeDead process. This PG state will be RESCHEDULING and going to scheduling.

step 2: Just when this PG was being scheduled, another node B also went down. Bundle 2 of this PG also enters this GcsPlacementGroupManager::OnNodeDead process.

step 3: Because this PG state is RESCHEDULING, the bundle 2 can't be added to pending queue。 In the end, the bundle 2 cannot be rescheduled.

Solution：
1、After each PG is scheduled success, judge whether the PG has any unplace bundles. If there is, then schedule it.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

rkooo567 · 2022-06-09T14:09:08Z

ETA for the review EoD tomorrow in Japan time zone

fishbone · 2022-06-29T22:49:12Z

Could you update the description with more detail about why it's failing and how this PR fixed this?

Also could you please merge your PR with master?

fishbone · 2022-07-06T01:08:56Z

src/ray/gcs/gcs_server/gcs_placement_group_manager.h

@@ -354,6 +360,12 @@ class GcsPlacementGroupManager : public rpc::PlacementGroupInfoHandler {
  // Update placement group load information so that the autoscaler can use it.
  void UpdatePlacementGroupLoad();

+  /// Check if this placement group is waiting for scheduling.
+  bool IsPlacmentGroupIDInPendingQueue(const PlacementGroupID &placement_group_id) const;


no strong opinion, but feel the name is too long.

IsInPendingQueue(...)

fishbone · 2022-07-06T01:17:12Z

I'm not sure, but shouldn't we just allow it to be pushed to pending queue?

@rkooo567 could you take a look?

larrylian · 2022-07-13T02:45:04Z

I'm not sure, but shouldn't we just allow it to be pushed to pending queue?
Pushing to the pending queue is fine.

fishbone · 2022-07-15T19:40:37Z

I'm trying to understand the PR and I tried this on my dev and it always pass (on master):

https://gist.github.com/6edaf987a00e185f50bfaf1f7168d758

I'm not sure what's the issue here. It'll be good if you can provide a simple script which can reproduce the issue.

larrylian · 2022-07-19T03:09:37Z

I'm trying to understand the PR and I tried this on my dev and it always pass (on master):

https://gist.github.com/6edaf987a00e185f50bfaf1f7168d758

I'm not sure what's the issue here. It'll be good if you can provide a simple script which can reproduce the issue.

This scene is not easy to reproduce. This can only modify the code of void NodeManager::HandlePrepareBundleResources and let HandlePrepareBundleResources wait for a while. In this way, during the scheduling process of bundle 1, bundle 2 triggers the OnNodeDead process.

a simple script which can reproduce the issue:

import pytest
import sys
import time
import ray
import ray.cluster_utils
from ray._private.test_utils import (
    get_other_nodes,
)

MB = 1024 * 1024


@pytest.mark.parametrize("repeat", list(range(3)))
def test_placement_group_failover(ray_start_cluster, repeat):
    @ray.remote(num_cpus=1)
    class Actor(object):
        def __init__(self):
            self.n = 0

        def value(self):
            return self.n

    cluster = ray_start_cluster
    num_nodes = 6
    nodes = []
    for _ in range(num_nodes):
        nodes.append(cluster.add_node(num_cpus=1))
    ray.init(address=cluster.address)
    bundles = [{"CPU": 1, "memory": 100 * MB} for _ in range(num_nodes)]
    placement_group = ray.util.placement_group(
        name="name", strategy="STRICT_SPREAD", bundles=bundles
    )
    assert placement_group.wait(10000)
    other_nodes = get_other_nodes(cluster, exclude_head=True)
    other_nodes_num = len(other_nodes)
    for i in range(other_nodes_num):
        cluster.add_node(num_cpus=1)
    cluster.wait_for_nodes()
    for node in other_nodes:
        cluster.remove_node(node)
    time.sleep(1)
    for i in range(num_nodes):
        actor = Actor.options(
            placement_group=placement_group, placement_group_bundle_index=i
        ).remote()
        object_ref = actor.value.remote()
        ray.get(object_ref, timeout=5)


if __name__ == "__main__":
    sys.exit(pytest.main(["-v", __file__]))

fishbone · 2022-07-20T03:17:31Z

@larrylian we have latency injection here https://docs.ray.io/en/latest/ray-contribute/debugging.html#callback-latency-injection

Maybe you can give that a try.

Btw, the script you give is what I tried locally. I build your branch and run the test script in this branch.
What I mean is that it always pass. Maybe because of the setup of the machine (I saw you have repeat 3 in the decorator, I guess it's not always reproducing?)

larrylian · 2022-07-20T03:39:50Z

@iycheng
It is difficult to reproduce without injecting the delay. The injection delay method you mentioned, I will understand it first.

fishbone · 2022-07-20T06:57:53Z

@iycheng It is difficult to reproduce without injecting the delay. The injection delay method you mentioned, I will understand it first.

Here are some examples which you can refer to https://sourcegraph.com/search?patternType=regexp&case=yes&q=context:%40iycheng/ray+RAY_testing_asio_delay_us

rkooo567

Can you make SchedulePendingPlacementGroups no-op if all bundles are already placed? I think we need this as a safety check

python/ray/tests/test_placement_group_fo.py

src/ray/gcs/gcs_server/gcs_placement_group_manager.cc

python/ray/tests/test_placement_group_fo.py

src/ray/gcs/gcs_server/gcs_placement_group_manager.cc

rkooo567 · 2022-07-21T17:05:05Z

I think we can easily reproduce it if we add some delay on either commit or prepare rpcs

stale · 2022-09-08T19:25:44Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale · 2022-09-24T02:46:53Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

…h dead Signed-off-by: 稚鱼 <[email protected]>

fishbone · 2022-11-16T18:29:53Z

src/ray/gcs/gcs_server/gcs_placement_group_manager.cc

@@ -728,6 +737,9 @@ void GcsPlacementGroupManager::OnNodeDead(const NodeID &node_id) {
    if (iter != registered_placement_groups_.end()) {
      for (const auto &bundle_index : bundle.second) {
        iter->second->GetMutableBundle(bundle_index)->clear_node_id();
+        RAY_LOG(INFO) << "Rescheduling a bundle when a node dies, placement group id:"


remove or just change to debug log

This function OnNodeDead will only come in when the node is abnormally down. Therefore, this log is usually not printed and will not be printed frequently. It will only be printed when the node is abnormal.

The production environment uses INFO level logs, and DEBUG logs cannot be used. Then, if a node fails in the production environment, you need to check this log to determine which bundles have FO.

So this log I think to keep and use INFO level.

+1. Can you address this?

xwjiang2010 · 2022-11-17T19:54:54Z

Reminder, the 2.2 release is scheduled for branch cut next Monday. Do folks feel that we are pretty close on this PR? Only asking as I notice that this PR has been out for a while.

rkooo567 · 2022-11-17T23:58:23Z

Let me run the nightly test before merging.

rkooo567 · 2022-11-18T00:00:12Z

[x] placement_group_stress_test
[x] pg_long_running_performance_test
[x] placement_group_performance_test
[ ] many_pgs (Running)

larrylian · 2022-11-18T02:25:57Z

[x] placement_group_stress_test [ ] pg_long_running_performance_test (Running) [ ] placement_group_performance_test [ ] many_pgs

Thanks, please contact me if you have any problem.

larrylian · 2022-11-21T02:27:08Z

[x] placement_group_stress_test [x] pg_long_running_performance_test [ ] placement_group_performance_test (Running) [ ] many_pgs

I see that most CIs have passed. Is there any problem with your night test?

rkooo567 · 2022-11-21T05:49:04Z

Running another 2 now. It seems promising so far

rkooo567 · 2022-11-21T05:54:59Z

@larrylian I will merge this by today (branch cut is tmrw). We should be able to verify all tests are passing by today

larrylian · 2022-11-21T06:29:32Z

@larrylian I will merge this by today (branch cut is tmrw). We should be able to verify all tests are passing by today

OK, Thanks.

fishbone · 2022-11-22T02:35:05Z

Thanks @rkooo567 for covering this. My bad, somehow I forgot it :(

…h dead (ray-project#24875) When a bundle is rescheduled due to node death, if other bundles of this pg also trigger rescheduling due to node death, there will be a bug that the bundle cannot be scheduled. Reason: step 1: Node A is down, and then bundle 1 of PG deployed on this node enters this GcsPlacementGroupManager::OnNodeDead process. This PG state will be RESCHEDULING and going to scheduling. step 2: Just when this PG was being scheduled, another node B also went down. Bundle 2 of this PG also enters this GcsPlacementGroupManager::OnNodeDead process. step 3: Because this PG state is RESCHEDULING, the bundle 2 can't be added to pending queue。 In the end, the bundle 2 cannot be rescheduled. Signed-off-by: Weichen Xu <[email protected]>

larrylian force-pushed the fix_pg_bundle_both_fo_bug branch from f363aba to b8c5142 Compare May 17, 2022 08:19

raulchen requested review from scv119, wumuzi520, rkooo567, Chong-Li and fishbone May 17, 2022 09:36

rkooo567 self-assigned this May 17, 2022

larrylian force-pushed the fix_pg_bundle_both_fo_bug branch from b8c5142 to 21520f2 Compare June 14, 2022 03:06

fishbone changed the title ~~[core/pg] Fix pg bundles can't reschedule bug when the two nodes both dead~~ [core][pg] Fix pg bundles can't reschedule bug when the two nodes both dead Jul 6, 2022

fishbone self-assigned this Jul 6, 2022

fishbone reviewed Jul 6, 2022

View reviewed changes

larrylian force-pushed the fix_pg_bundle_both_fo_bug branch from 21520f2 to 592af5a Compare July 13, 2022 02:42

rkooo567 requested changes Jul 21, 2022

View reviewed changes

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 8, 2022

stale bot closed this Sep 24, 2022

larrylian reopened this Oct 11, 2022

larrylian force-pushed the fix_pg_bundle_both_fo_bug branch 2 times, most recently from 85583d7 to 44d2a2f Compare October 11, 2022 08:19

larrylian requested a review from rkooo567 October 11, 2022 08:31

larrylian force-pushed the fix_pg_bundle_both_fo_bug branch from fff2817 to e10cc6a Compare October 12, 2022 02:21

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 12, 2022

larrylian force-pushed the fix_pg_bundle_both_fo_bug branch from e10cc6a to b87e08e Compare October 12, 2022 02:34

larrylian force-pushed the fix_pg_bundle_both_fo_bug branch 3 times, most recently from 8d81072 to 5fce60b Compare November 3, 2022 03:27

[core][pg] Fix pg bundles can't reschedule bug when the two nodes bot…

e422a9b

…h dead Signed-off-by: 稚鱼 <[email protected]>

larrylian force-pushed the fix_pg_bundle_both_fo_bug branch from 5fce60b to e422a9b Compare November 14, 2022 11:36

fishbone reviewed Nov 16, 2022

View reviewed changes

fishbone added release-blocker P0 Issue that blocks the release Ray 2.2 labels Nov 17, 2022

larrylian requested a review from fishbone November 17, 2022 07:02

scv119 added the P0 Issues that should be fixed in short order label Nov 17, 2022

rkooo567 approved these changes Nov 17, 2022

View reviewed changes

rkooo567 merged commit ddfeae3 into ray-project:master Nov 21, 2022

xwjiang2010 mentioned this pull request Nov 21, 2022

[core][ci] test_snapshot is failing/flaky #30542

Closed

This was referenced Nov 22, 2022

[Test] gcs_placement_group_manager_test flaky in the master #30573

Closed

fix gcs_placement_group_test flaky #30577

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][pg] Fix pg bundles can't reschedule bug when the two nodes both dead #24875

[core][pg] Fix pg bundles can't reschedule bug when the two nodes both dead #24875

larrylian commented May 17, 2022 •

edited

Loading

rkooo567 commented Jun 9, 2022

fishbone commented Jun 29, 2022

fishbone Jul 6, 2022

larrylian Jul 13, 2022

fishbone commented Jul 6, 2022

larrylian commented Jul 13, 2022

fishbone commented Jul 15, 2022

larrylian commented Jul 19, 2022

fishbone commented Jul 20, 2022 •

edited

Loading

larrylian commented Jul 20, 2022 •

edited

Loading

fishbone commented Jul 20, 2022

rkooo567 left a comment

rkooo567 commented Jul 21, 2022

stale bot commented Sep 8, 2022

stale bot commented Sep 24, 2022

fishbone Nov 16, 2022

larrylian Nov 17, 2022

rkooo567 Nov 17, 2022

xwjiang2010 commented Nov 17, 2022

rkooo567 commented Nov 17, 2022

rkooo567 commented Nov 18, 2022 •

edited

Loading

larrylian commented Nov 18, 2022

larrylian commented Nov 21, 2022

rkooo567 commented Nov 21, 2022

rkooo567 commented Nov 21, 2022 •

edited

Loading

larrylian commented Nov 21, 2022

fishbone commented Nov 22, 2022

[core][pg] Fix pg bundles can't reschedule bug when the two nodes both dead #24875

[core][pg] Fix pg bundles can't reschedule bug when the two nodes both dead #24875

Conversation

larrylian commented May 17, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

rkooo567 commented Jun 9, 2022

fishbone commented Jun 29, 2022

fishbone Jul 6, 2022

Choose a reason for hiding this comment

larrylian Jul 13, 2022

Choose a reason for hiding this comment

fishbone commented Jul 6, 2022

larrylian commented Jul 13, 2022

fishbone commented Jul 15, 2022

larrylian commented Jul 19, 2022

fishbone commented Jul 20, 2022 • edited Loading

larrylian commented Jul 20, 2022 • edited Loading

fishbone commented Jul 20, 2022

rkooo567 left a comment

Choose a reason for hiding this comment

rkooo567 commented Jul 21, 2022

stale bot commented Sep 8, 2022

stale bot commented Sep 24, 2022

fishbone Nov 16, 2022

Choose a reason for hiding this comment

larrylian Nov 17, 2022

Choose a reason for hiding this comment

rkooo567 Nov 17, 2022

Choose a reason for hiding this comment

xwjiang2010 commented Nov 17, 2022

rkooo567 commented Nov 17, 2022

rkooo567 commented Nov 18, 2022 • edited Loading

larrylian commented Nov 18, 2022

larrylian commented Nov 21, 2022

rkooo567 commented Nov 21, 2022

rkooo567 commented Nov 21, 2022 • edited Loading

larrylian commented Nov 21, 2022

fishbone commented Nov 22, 2022

larrylian commented May 17, 2022 •

edited

Loading

fishbone commented Jul 20, 2022 •

edited

Loading

larrylian commented Jul 20, 2022 •

edited

Loading

rkooo567 commented Nov 18, 2022 •

edited

Loading

rkooo567 commented Nov 21, 2022 •

edited

Loading