Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Cleanup ActorManager and WorkerSet: Make all mark_healthy/healthy_only method args True by default. #44993

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Apr 26, 2024

Cleanup ActorManager and WorkerSet: Make all mark_healthy/healthy_only method args True by default.

Also renamed WorkerSet.__actor_manager not super-private anymore (super-privates make it impossible to debug properly!).

Reasoning:
The default behavior of a WorkerSet.foreach_... call should be:

  • Only send requests to healthy workers by default. This will avoid having to wait up to the full timeout (sometimes a minute) for workers that we already know are NOT healthy.
  • mark_healthy should always be True anyways! If any worker that is being sent q request - for whatever reason - then responds properly, we should mark it healthy, unless there is a good reason NOT to do so.
  • Once an iteration, Algorithm will try to restore all currently marked "unhealthy" workers via its restore_workers API. This method will send a special "ping" request to all unhealthy (only) workers (with a now shortened timeout of 30sec (before 60s)) and mark them healthy, if they respond well.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Apr 26, 2024
Copy link
Collaborator

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some comments in regard to potential influence on sample timout settings.

timeout_seconds=self.config.worker_restore_timeout_s,
# Bring back actor after successful state syncing.
mark_healthy=True,
timeout_seconds=self.config.env_runner_restore_timeout_s,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this timeout and restoration time in general influence in any way the sampling (blocking it or slowing down for example)? I ask b/c in this case the sample_timeout_s could lead to no samples returned by synchronous_parallel_sample and lead to errors.

env_runner_health_probe_timeout_s: Max amount of time we should spend
waiting for health probe calls to finish. Health pings are very cheap,
so the default is 1 minute.
env_runner_health_probe_timeout_s: Max amount of time in seconds, we should
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well: Any influence on/by sample_timeout_s?

@sven1977 sven1977 merged commit 1b36103 into ray-project:master Apr 27, 2024
5 checks passed
@sven1977 sven1977 deleted the switch_default_of_mark_healthy_to_true branch October 25, 2024 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants