-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Add eval worker sub-env fault tolerance test. #26276
[RLlib] Add eval worker sub-env fault tolerance test. #26276
Conversation
rllib/evaluation/sampler.py
Outdated
@@ -1082,6 +1081,7 @@ def _process_observations( | |||
# If reset is async, we will get its result in some future poll. | |||
elif resetted_obs != ASYNC_RESET_RETURN: | |||
new_episode: Episode = active_episodes[env_id] | |||
assert not new_episode.is_faulty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Add an informative error message here.
…eval_worker_sub_env_failure_tests
@@ -2,8 +2,8 @@ cartpole-crashing-pg: | |||
env: ray.rllib.examples.env.cartpole_crashing.CartPoleCrashing | |||
run: PG | |||
stop: | |||
episode_reward_mean: 150.0 | |||
timesteps_total: 120000 | |||
evaluation/episode_reward_mean: 180.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure, learning success is measurable by eval workers.
@@ -83,13 +83,16 @@ def check_memory_leaks( | |||
action_sample = action_space.sample() | |||
|
|||
def code(): | |||
horizon = algorithm.config["horizon"] or float("inf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Respect horizon setting in this utility.
@@ -76,6 +76,7 @@ | |||
from ray.rllib.utils.test_utils import check_learning_achieved | |||
|
|||
parser = argparse.ArgumentParser() | |||
parser.add_argument("--evaluation-parallel-to-training", action="store_true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add parallel option to this script and add setting this option to tests in BUILD.
@@ -347,6 +350,30 @@ def test_no_env_but_eval_workers_do_have_env(self): | |||
bc.train() | |||
bc.stop() | |||
|
|||
def test_eval_workers_on_infinite_episodes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New test to check for proper behavior if eval workers are set to run by episodes, but episodes dont terminate.
@@ -175,6 +175,7 @@ def __init__(self, algo_class=None): | |||
self.evaluation_interval = None | |||
self.evaluation_duration = 10 | |||
self.evaluation_duration_unit = "episodes" | |||
self.evaluation_sample_timeout_s = 180.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New config value. Triggers a meaningful warning if breached with various things to try to fix the timeouts.
@@ -2339,6 +2358,14 @@ def _run_one_evaluation( | |||
"recreate_failed_workers" | |||
), | |||
) | |||
|
|||
# Add number of healthy evaluation workers after this iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric was missing.
@@ -753,7 +753,8 @@ def duration_fn(num_units_done): | |||
env_steps_this_iter += batch.env_steps() | |||
metrics = collect_metrics( | |||
self.workers.local_worker(), | |||
keep_custom_metrics=self.config["keep_per_episode_custom_metrics"], | |||
keep_custom_metrics=eval_cfg["keep_per_episode_custom_metrics"], | |||
timeout_seconds=eval_cfg["metrics_episode_collection_timeout_s"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug: timeout setting was not passed to eval collect-metrics call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sven1977 LGTM.
…eval_worker_sub_env_failure_tests
…eval_worker_sub_env_failure_tests
…eval_worker_sub_env_failure_tests
…eval_worker_sub_env_failure_tests
Signed-off-by: sven1977 <[email protected]>
…eval_worker_sub_env_failure_tests
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…crashes during `reset()`; +more tests and logging; add eval worker sub-env fault tolerance test. (ray-project#26276) Signed-off-by: Xiaowei Jiang <[email protected]>
…crashes during `reset()`; +more tests and logging; add eval worker sub-env fault tolerance test. (ray-project#26276) Signed-off-by: Stefan van der Kleij <[email protected]>
restart_failed_sub_environments
feature for multi-agent settings (incl. multi-agent w/remote_worker_env=True
).restart_failed_sub_environments
in case a sub-environment fails during itsreset()
call (this was previously not supported).restart_failed_sub_environments
for multi-agent cartpole (crashing once in a while) and multi-agent cartpole with remote worker envs (crashing once in a while).TODO: Documentation PR that better explains what happens under the hood in case
restart_failed_sub_environments=True
and an environment fails during step or reset.Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.