Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Fix max number of actors for default actor pool strategy #26266

Merged
merged 3 commits into from
Jul 3, 2022

Conversation

c21
Copy link
Contributor

@c21 c21 commented Jul 2, 2022

Why are these changes needed?

This PR is to fix the maximal number of actors used in datasets tranformation, when using default actor pool strategy (i.e. ActorPoolStrategy()). It should not exceed 1.25 * num_cpus set in the cluster.

Related issue number

Closes #26221

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@c21 c21 changed the title Fix max number of actors for default actor pool strategy [Datasets] Fix max number of actors for default actor pool strategy Jul 2, 2022
@@ -295,7 +297,8 @@ def map_block_nosplit(
if not ready:
if (
len(workers) < self.max_size
and len(ready_workers) / len(workers) > 0.8
and len(ready_workers) / len(workers)
> self.ready_to_total_workers_ratio
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird format to me, but this is the result after running scripts/format.sh.

def test_actorpoolstrategy_default_max_actors(shutdown_only):
def f(x):
import time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, scripts/format.sh decides to add a new line here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep our formatting script will add a newline after imports, even dynamic imports within functions.

@@ -4233,7 +4233,7 @@ def f(should_import_polars):
ctx.use_polars = original_use_polars


def test_actorpoolstrategy_apply_interrupt():
def test_actorpoolstrategy_apply_interrupt(shutdown_only):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shutdown_only has to add here, o.w. the newly added unit test test_actorpoolstrategy_default_max_actors below cannot call ray.init again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is to be expected for tests that start a Ray cluster! If the cluster is shareable (e.g. same cluster config), the ray_start_regular_shared fixture takes care of sharing a common cluster across tests. But since this test needs a special cluster config, the shutdown_only fixture is the way to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supernit: test_actor_pool_strategy_apply_interrupt and test_actor_pool_strategy_default_max_actors may be a bit more readable, snake-casing the camel-cased class name instead of actorpoolstrategy. This is up to personal preference I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clarkzinzow - agree, updated.

num_cpus * (1 / compute_strategy.ready_to_total_workers_ratio)
)
assert (
compute_strategy.num_workers <= expected_max_num_workers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also assert that it scaled up to at least 5 workers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericl - makes sense, added.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 2, 2022
@c21 c21 removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 2, 2022
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming you've checked the test fails before the fix.

@c21
Copy link
Contributor Author

c21 commented Jul 2, 2022

LGTM, assuming you've checked the test fails before the fix.

@ericl - yes, checked before creating the PR, thanks for review.

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -4233,7 +4233,7 @@ def f(should_import_polars):
ctx.use_polars = original_use_polars


def test_actorpoolstrategy_apply_interrupt():
def test_actorpoolstrategy_apply_interrupt(shutdown_only):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this is to be expected for tests that start a Ray cluster! If the cluster is shareable (e.g. same cluster config), the ray_start_regular_shared fixture takes care of sharing a common cluster across tests. But since this test needs a special cluster config, the shutdown_only fixture is the way to go.

@@ -4233,7 +4233,7 @@ def f(should_import_polars):
ctx.use_polars = original_use_polars


def test_actorpoolstrategy_apply_interrupt():
def test_actorpoolstrategy_apply_interrupt(shutdown_only):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supernit: test_actor_pool_strategy_apply_interrupt and test_actor_pool_strategy_default_max_actors may be a bit more readable, snake-casing the camel-cased class name instead of actorpoolstrategy. This is up to personal preference I think.

def test_actorpoolstrategy_default_max_actors(shutdown_only):
def f(x):
import time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep our formatting script will add a newline after imports, even dynamic imports within functions.

@ericl ericl merged commit 7360452 into ray-project:master Jul 3, 2022
@ericl
Copy link
Contributor

ericl commented Jul 3, 2022

Merged, thanks!

@c21
Copy link
Contributor Author

c21 commented Jul 4, 2022

Thank you @ericl and @clarkzinzow for review!

@c21 c21 deleted the max-actors branch July 4, 2022 07:05
truelegion47 pushed a commit to truelegion47/ray that referenced this pull request Jul 6, 2022
* master: (104 commits)
  [Serve] Java Client API and End to End Tests (ray-project#22726)
  [Docs] Small fix to AIR examples descriptions (ray-project#26227)
  [Deployment Graph] Move `Deployment` creation outside to build function (ray-project#26129)
  [K8s][Ray Operator] Ignore resource requests when detected container resources. (ray-project#26234)
  Revert "[Core] Add retry exception allowlist for user-defined filteri… (ray-project#26289)
  [ci] pin gpustat (ray-project#26311)
  [tune] fix `set_tune_experiment` (ray-project#26298)
  Revert "Revert "[AIR][Serve] Rename ModelWrapperDeployment -> PredictorDeployment"" (ray-project#26231)
  [Release] Use nightly base images for release tests (ray-project#25373)
  Revert "[Core] fix gRPC handlers' unlimited active calls configuration (ray-project#25626)" (ray-project#26202)
  [RLlib] Some Docs fixes (2). (ray-project#26265)
  [C++ worker] Refine worker context and more (ray-project#26281)
  Fix file_system_monitor.cc message (ray-project#26143)
  [Java] Make Java test more stable (ray-project#26282)
  [air] Do not warn of `checkpoint_dir` if it's coming from us (base_trainer). (ray-project#26259)
  [Datasets] Support drop_columns API (ray-project#26200)
  [Datasets] Fix max number of actors for default actor pool strategy (ray-project#26266)
  [ci] Stop syncer staging tests (ray-project#26273)
  [core][gcs] Add storage namespace to redis storage in GCS. (ray-project#25994)
  [workflow] Deprecate workflow.create (ray-project#26106)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[data] Actor pool scales infinitely now if a max size is not specified
4 participants