Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Update placement group retry implementation #18842

Merged
merged 56 commits into from
Oct 5, 2021

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Sep 23, 2021

Why are these changes needed?

Please check the ticket for more detail of this PR.

Previously, we always post a retry task even there is one in the queue because the waiting is implemented based on defer submission. There are several problems here:

  • retry accumulates in the pool
  • the retried one doesn't schedule the failed one in some cases
  • more complicated retry is not supported.

This PR fixed this issue:

  • Use priority queue for pending placement group
  • Don't use io context for retry
  • Exponential backoff for retry

Related issue number

Closes #18541

#18651

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567 rkooo567 self-assigned this Sep 23, 2021
@fishbone fishbone changed the title [wip] Pg schedule [core] Update placement group retry implementation Sep 25, 2021
@fishbone fishbone marked this pull request as ready for review September 25, 2021 06:51
@fishbone fishbone changed the title [core] Update placement group retry implementation [wip][core] Update placement group retry implementation Sep 25, 2021
@fishbone
Copy link
Contributor Author

I'm working on unit testing.
e2e test is running.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

src/ray/common/ray_config_def.h Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_placement_group_manager.h Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 25, 2021
python/ray/tune/tests/test_trial_runner_pg.py Outdated Show resolved Hide resolved
python/ray/tune/utils/placement_groups.py Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_placement_group_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_placement_group_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/test/gcs_test_util.h Outdated Show resolved Hide resolved
src/ray/util/util.h Show resolved Hide resolved
src/ray/util/util.h Outdated Show resolved Hide resolved
@fishbone
Copy link
Contributor Author

fishbone commented Oct 1, 2021

new

success = 1
avg_pg_create_time_ms = 1.1399193753719132
avg_pg_remove_time_ms = 4.032320572071604
_runtime = 479.986270904541
_session_url = https://beta.anyscale.com/o/anyscale-internal/projects/prj_LFKjNprpiYt4AjAf1NdLDJwn/clusters/ses_yjTUA8gaZDSjF43Et5YJHfj5
_commit_url = https://ray-wheels.s3.us-west-2.amazonaws.com/test_wheels/pg-schedule/01b13b601bca5f168fbb367341f330db191c9e83/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
_stable = True

@fishbone
Copy link
Contributor Author

fishbone commented Oct 1, 2021

prod





success = 1
  | avg_pg_create_time_ms = 1.0101662852838733
  | avg_pg_remove_time_ms = 4.4615456231242465
  | _runtime = 461.87430810928345
  | _session_url = https://beta.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_K3pD2PKyVqkGDV4iD4XBdqrX
  | _commit_url = https://s3-us-west-2.amazonaws.com/ray-wheels/master/6dc1a6b72f687c8ed8bcb8e687d928afd442f5f5/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
  | _stable = True
 ````

@fishbone fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 2, 2021
@fishbone
Copy link
Contributor Author

fishbone commented Oct 2, 2021

@rkooo567 it's good for another look now

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comments.

BUILD.bazel Outdated
# "@com_google_googletest//:gtest_main",
# ],
# )
cc_test(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. But if you ask, I can move it to another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but why is it here? Was it intentional? (I wondered if it has bad merge with the master or sth initially)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think it's actually bad behavior to mix things not related to this PR here.

It's intentional... I feel shame about this... Let me remove this change and submit another PR for this.

ASSERT_EQ(1, pending_queue.size());
ASSERT_EQ(rank, pending_queue.begin()->first);

absl::SleepFor(absl::Nanoseconds(rank - absl::GetCurrentTimeNanos()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you at least use wait for ...? Based on my experience having a absolute time sleep usually was the source of flakniess

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Feel free to merge if tests pass!

ASSERT_EQ(1, pending_queue.size());
ASSERT_EQ(rank, pending_queue.begin()->first);

absl::SleepFor(absl::Nanoseconds(rank - absl::GetCurrentTimeNanos()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to run 1000 times. Since cpp tests are almost never flaky, it will
Be easy to detect if that happens. If you think this is fine, I can just merge and see how the result will look like

@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Oct 4, 2021
@fishbone
Copy link
Contributor Author

fishbone commented Oct 4, 2021

@rkooo567 just updated it. let me know if I missed anything

@rkooo567
Copy link
Contributor

rkooo567 commented Oct 4, 2021

Screen Shot 2021-10-04 at 3 56 21 PM

Seems like this fails

@fishbone
Copy link
Contributor Author

fishbone commented Oct 4, 2021

Screen Shot 2021-10-04 at 3 56 21 PM

Seems like this fails
Change LT to LE because of lock resolutions is us and within 1us it finished the code.

@fishbone
Copy link
Contributor Author

fishbone commented Oct 5, 2021

windows test of //python/ray/tests:test_client_reconnect is flaky

@fishbone fishbone merged commit 056c3af into ray-project:master Oct 5, 2021

if (!exp_backer) {
exp_backer = ExponentialBackOff(
1000000 *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is 1000000? Why times 1000000 if it's ms?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[release] many_ppo failing: "There was timeout in removing the placement group "
4 participants