-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Update placement group retry implementation #18842
Conversation
I'm working on unit testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
new
|
prod
|
@rkooo567 it's good for another look now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comments.
BUILD.bazel
Outdated
# "@com_google_googletest//:gtest_main", | ||
# ], | ||
# ) | ||
cc_test( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this relevant to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. But if you ask, I can move it to another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, but why is it here? Was it intentional? (I wondered if it has bad merge with the master or sth initially)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think it's actually bad behavior to mix things not related to this PR here.
It's intentional... I feel shame about this... Let me remove this change and submit another PR for this.
ASSERT_EQ(1, pending_queue.size()); | ||
ASSERT_EQ(rank, pending_queue.begin()->first); | ||
|
||
absl::SleepFor(absl::Nanoseconds(rank - absl::GetCurrentTimeNanos())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you at least use wait for ...? Based on my experience having a absolute time sleep usually was the source of flakniess
src/ray/gcs/gcs_server/test/gcs_placement_group_manager_mock_test.cc
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Feel free to merge if tests pass!
ASSERT_EQ(1, pending_queue.size()); | ||
ASSERT_EQ(rank, pending_queue.begin()->first); | ||
|
||
absl::SleepFor(absl::Nanoseconds(rank - absl::GetCurrentTimeNanos())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to run 1000 times. Since cpp tests are almost never flaky, it will
Be easy to detect if that happens. If you think this is fine, I can just merge and see how the result will look like
src/ray/gcs/gcs_server/test/gcs_placement_group_manager_mock_test.cc
Outdated
Show resolved
Hide resolved
@rkooo567 just updated it. let me know if I missed anything |
windows test of |
|
||
if (!exp_backer) { | ||
exp_backer = ExponentialBackOff( | ||
1000000 * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is 1000000? Why times 1000000 if it's ms?
Why are these changes needed?
Please check the ticket for more detail of this PR.
Previously, we always post a retry task even there is one in the queue because the waiting is implemented based on defer submission. There are several problems here:
This PR fixed this issue:
Related issue number
Closes #18541
#18651
Checks
scripts/format.sh
to lint the changes in this PR.