-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][pg] Fix pg bundles can't reschedule bug when the two nodes both dead #24875
Conversation
f363aba
to
b8c5142
Compare
ETA for the review EoD tomorrow in Japan time zone |
b8c5142
to
21520f2
Compare
Could you update the description with more detail about why it's failing and how this PR fixed this? Also could you please merge your PR with master? |
@@ -354,6 +360,12 @@ class GcsPlacementGroupManager : public rpc::PlacementGroupInfoHandler { | |||
// Update placement group load information so that the autoscaler can use it. | |||
void UpdatePlacementGroupLoad(); | |||
|
|||
/// Check if this placement group is waiting for scheduling. | |||
bool IsPlacmentGroupIDInPendingQueue(const PlacementGroupID &placement_group_id) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no strong opinion, but feel the name is too long.
IsInPendingQueue(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
I'm not sure, but shouldn't we just allow it to be pushed to pending queue? @rkooo567 could you take a look? |
21520f2
to
592af5a
Compare
|
I'm trying to understand the PR and I tried this on my dev and it always pass (on master): https://gist.github.com/6edaf987a00e185f50bfaf1f7168d758 I'm not sure what's the issue here. It'll be good if you can provide a simple script which can reproduce the issue. |
This scene is not easy to reproduce. This can only modify the code of void NodeManager::HandlePrepareBundleResources and let HandlePrepareBundleResources wait for a while. In this way, during the scheduling process of bundle 1, bundle 2 triggers the OnNodeDead process. a simple script which can reproduce the issue:
|
@larrylian we have latency injection here https://docs.ray.io/en/latest/ray-contribute/debugging.html#callback-latency-injection Maybe you can give that a try. Btw, the script you give is what I tried locally. I build your branch and run the test script in this branch. |
@iycheng |
Here are some examples which you can refer to https://sourcegraph.com/search?patternType=regexp&case=yes&q=context:%40iycheng/ray+RAY_testing_asio_delay_us |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make SchedulePendingPlacementGroups
no-op if all bundles are already placed? I think we need this as a safety check
I think we can easily reproduce it if we add some delay on either commit or prepare rpcs |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
85583d7
to
44d2a2f
Compare
fff2817
to
e10cc6a
Compare
e10cc6a
to
b87e08e
Compare
8d81072
to
5fce60b
Compare
…h dead Signed-off-by: 稚鱼 <[email protected]>
5fce60b
to
e422a9b
Compare
@@ -728,6 +737,9 @@ void GcsPlacementGroupManager::OnNodeDead(const NodeID &node_id) { | |||
if (iter != registered_placement_groups_.end()) { | |||
for (const auto &bundle_index : bundle.second) { | |||
iter->second->GetMutableBundle(bundle_index)->clear_node_id(); | |||
RAY_LOG(INFO) << "Rescheduling a bundle when a node dies, placement group id:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove or just change to debug log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This function OnNodeDead will only come in when the node is abnormally down. Therefore, this log is usually not printed and will not be printed frequently. It will only be printed when the node is abnormal.
- The production environment uses INFO level logs, and DEBUG logs cannot be used. Then, if a node fails in the production environment, you need to check this log to determine which bundles have FO.
So this log I think to keep and use INFO level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Can you address this?
Reminder, the 2.2 release is scheduled for branch cut next Monday. Do folks feel that we are pretty close on this PR? Only asking as I notice that this PR has been out for a while. |
Let me run the nightly test before merging. |
[x] placement_group_stress_test |
Thanks, please contact me if you have any problem. |
I see that most CIs have passed. Is there any problem with your night test? |
Running another 2 now. It seems promising so far |
@larrylian I will merge this by today (branch cut is tmrw). We should be able to verify all tests are passing by today |
OK, Thanks. |
Thanks @rkooo567 for covering this. My bad, somehow I forgot it :( |
…h dead (ray-project#24875) When a bundle is rescheduled due to node death, if other bundles of this pg also trigger rescheduling due to node death, there will be a bug that the bundle cannot be scheduled. Reason: step 1: Node A is down, and then bundle 1 of PG deployed on this node enters this GcsPlacementGroupManager::OnNodeDead process. This PG state will be RESCHEDULING and going to scheduling. step 2: Just when this PG was being scheduled, another node B also went down. Bundle 2 of this PG also enters this GcsPlacementGroupManager::OnNodeDead process. step 3: Because this PG state is RESCHEDULING, the bundle 2 can't be added to pending queue。 In the end, the bundle 2 cannot be rescheduled. Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
When a bundle is rescheduled due to node death, if other bundles of this pg also trigger rescheduling due to node death, there will be a bug that the bundle cannot be scheduled.
Reason:
step 1: Node A is down, and then bundle 1 of PG deployed on this node enters this GcsPlacementGroupManager::OnNodeDead process. This PG state will be RESCHEDULING and going to scheduling.
step 2: Just when this PG was being scheduled, another node B also went down. Bundle 2 of this PG also enters this GcsPlacementGroupManager::OnNodeDead process.
step 3: Because this PG state is RESCHEDULING, the bundle 2 can't be added to pending queue。 In the end, the bundle 2 cannot be rescheduled.
Solution:
1、After each PG is scheduled success, judge whether the PG has any unplace bundles. If there is, then schedule it.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.