[RLlib] Cleanup `ActorManager` and `WorkerSet`: Make all `mark_healthy`/`healthy_only` method args `True` by default. #44993

sven1977 · 2024-04-26T10:51:38Z

Cleanup ActorManager and WorkerSet: Make all mark_healthy/healthy_only method args True by default.

Also renamed WorkerSet.__actor_manager not super-private anymore (super-privates make it impossible to debug properly!).

Reasoning:
The default behavior of a WorkerSet.foreach_... call should be:

Only send requests to healthy workers by default. This will avoid having to wait up to the full timeout (sometimes a minute) for workers that we already know are NOT healthy.
mark_healthy should always be True anyways! If any worker that is being sent q request - for whatever reason - then responds properly, we should mark it healthy, unless there is a good reason NOT to do so.
Once an iteration, Algorithm will try to restore all currently marked "unhealthy" workers via its restore_workers API. This method will send a special "ping" request to all unhealthy (only) workers (with a now shortened timeout of 30sec (before 60s)) and mark them healthy, if they respond well.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

simonsays1980

LGTM. Some comments in regard to potential influence on sample timout settings.

simonsays1980 · 2024-04-27T12:13:59Z

rllib/algorithms/algorithm.py

-                timeout_seconds=self.config.worker_restore_timeout_s,
-                # Bring back actor after successful state syncing.
-                mark_healthy=True,
+                timeout_seconds=self.config.env_runner_restore_timeout_s,


Does this timeout and restoration time in general influence in any way the sampling (blocking it or slowing down for example)? I ask b/c in this case the sample_timeout_s could lead to no samples returned by synchronous_parallel_sample and lead to errors.

simonsays1980 · 2024-04-27T12:15:08Z

rllib/algorithms/algorithm_config.py

-            env_runner_health_probe_timeout_s: Max amount of time we should spend
-                waiting for health probe calls to finish. Health pings are very cheap,
-                so the default is 1 minute.
+            env_runner_health_probe_timeout_s: Max amount of time in seconds, we should


Here as well: Any influence on/by sample_timeout_s?

Signed-off-by: Sven Mika <[email protected]>

wip

c6db364

Signed-off-by: sven1977 <[email protected]>

sven1977 marked this pull request as ready for review April 26, 2024 10:51

sven1977 requested review from avnishn, ArturNiederfahrenhorst, maxpumperla, kouroshHakha and simonsays1980 as code owners April 26, 2024 10:51

wip

1557d46

Signed-off-by: sven1977 <[email protected]>

sven1977 assigned simonsays1980 Apr 26, 2024

sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 26, 2024

simonsays1980 approved these changes Apr 27, 2024

View reviewed changes

Merge branch 'master' into switch_default_of_mark_healthy_to_true

da5fc0b

Signed-off-by: Sven Mika <[email protected]>

sven1977 merged commit 1b36103 into ray-project:master Apr 27, 2024
5 checks passed

sven1977 deleted the switch_default_of_mark_healthy_to_true branch October 25, 2024 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Cleanup `ActorManager` and `WorkerSet`: Make all `mark_healthy`/`healthy_only` method args `True` by default. #44993

[RLlib] Cleanup `ActorManager` and `WorkerSet`: Make all `mark_healthy`/`healthy_only` method args `True` by default. #44993

sven1977 commented Apr 26, 2024 •

edited

Loading

simonsays1980 left a comment

simonsays1980 Apr 27, 2024

simonsays1980 Apr 27, 2024

[RLlib] Cleanup ActorManager and WorkerSet: Make all mark_healthy/healthy_only method args True by default. #44993

[RLlib] Cleanup ActorManager and WorkerSet: Make all mark_healthy/healthy_only method args True by default. #44993

Conversation

sven1977 commented Apr 26, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

simonsays1980 left a comment

Choose a reason for hiding this comment

simonsays1980 Apr 27, 2024

Choose a reason for hiding this comment

simonsays1980 Apr 27, 2024

Choose a reason for hiding this comment

[RLlib] Cleanup `ActorManager` and `WorkerSet`: Make all `mark_healthy`/`healthy_only` method args `True` by default. #44993

[RLlib] Cleanup `ActorManager` and `WorkerSet`: Make all `mark_healthy`/`healthy_only` method args `True` by default. #44993

sven1977 commented Apr 26, 2024 •

edited

Loading