Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Add eval worker sub-env fault tolerance test. #26276

Merged

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jul 3, 2022

  • Fix the restart_failed_sub_environments feature for multi-agent settings (incl. multi-agent w/ remote_worker_env=True).
  • Fix the missing support for restart_failed_sub_environments in case a sub-environment fails during its reset() call (this was previously not supported).
  • Adds logging messages around sub-environments getting restarted: "trying to restart ..." and "restarted successfully".
  • Change the existing PG sub-environment fault tolerance learning test to one that also adds eval workers (parallel eval and training).
  • Add more PG learning tests for setting restart_failed_sub_environments for multi-agent cartpole (crashing once in a while) and multi-agent cartpole with remote worker envs (crashing once in a while).
  • Adds parallel eval and training option to custom_eval.py example script and add to BUILD.

TODO: Documentation PR that better explains what happens under the hood in case restart_failed_sub_environments=True and an environment fails during step or reset.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -1082,6 +1081,7 @@ def _process_observations(
# If reset is async, we will get its result in some future poll.
elif resetted_obs != ASYNC_RESET_RETURN:
new_episode: Episode = active_episodes[env_id]
assert not new_episode.is_faulty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Add an informative error message here.

@@ -2,8 +2,8 @@ cartpole-crashing-pg:
env: ray.rllib.examples.env.cartpole_crashing.CartPoleCrashing
run: PG
stop:
episode_reward_mean: 150.0
timesteps_total: 120000
evaluation/episode_reward_mean: 180.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure, learning success is measurable by eval workers.

@@ -83,13 +83,16 @@ def check_memory_leaks(
action_sample = action_space.sample()

def code():
horizon = algorithm.config["horizon"] or float("inf")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Respect horizon setting in this utility.

@@ -76,6 +76,7 @@
from ray.rllib.utils.test_utils import check_learning_achieved

parser = argparse.ArgumentParser()
parser.add_argument("--evaluation-parallel-to-training", action="store_true")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add parallel option to this script and add setting this option to tests in BUILD.

@@ -347,6 +350,30 @@ def test_no_env_but_eval_workers_do_have_env(self):
bc.train()
bc.stop()

def test_eval_workers_on_infinite_episodes(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New test to check for proper behavior if eval workers are set to run by episodes, but episodes dont terminate.

@@ -175,6 +175,7 @@ def __init__(self, algo_class=None):
self.evaluation_interval = None
self.evaluation_duration = 10
self.evaluation_duration_unit = "episodes"
self.evaluation_sample_timeout_s = 180.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New config value. Triggers a meaningful warning if breached with various things to try to fix the timeouts.

@@ -2339,6 +2358,14 @@ def _run_one_evaluation(
"recreate_failed_workers"
),
)

# Add number of healthy evaluation workers after this iteration.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric was missing.

@@ -753,7 +753,8 @@ def duration_fn(num_units_done):
env_steps_this_iter += batch.env_steps()
metrics = collect_metrics(
self.workers.local_worker(),
keep_custom_metrics=self.config["keep_per_episode_custom_metrics"],
keep_custom_metrics=eval_cfg["keep_per_episode_custom_metrics"],
timeout_seconds=eval_cfg["metrics_episode_collection_timeout_s"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug: timeout setting was not passed to eval collect-metrics call.

Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sven1977 LGTM.

Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 merged commit 4aea24c into ray-project:master Jul 15, 2022
xwjiang2010 pushed a commit to xwjiang2010/ray that referenced this pull request Jul 19, 2022
…crashes during `reset()`; +more tests and logging; add eval worker sub-env fault tolerance test. (ray-project#26276)

Signed-off-by: Xiaowei Jiang <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…crashes during `reset()`; +more tests and logging; add eval worker sub-env fault tolerance test. (ray-project#26276)

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants