[RLlib] Add eval worker sub-env fault tolerance test. #26276

sven1977 · 2022-07-03T17:55:11Z

Fix the restart_failed_sub_environments feature for multi-agent settings (incl. multi-agent w/ remote_worker_env=True).
Fix the missing support for restart_failed_sub_environments in case a sub-environment fails during its reset() call (this was previously not supported).
Adds logging messages around sub-environments getting restarted: "trying to restart ..." and "restarted successfully".
Change the existing PG sub-environment fault tolerance learning test to one that also adds eval workers (parallel eval and training).
Add more PG learning tests for setting restart_failed_sub_environments for multi-agent cartpole (crashing once in a while) and multi-agent cartpole with remote worker envs (crashing once in a while).
Adds parallel eval and training option to custom_eval.py example script and add to BUILD.

TODO: Documentation PR that better explains what happens under the hood in case restart_failed_sub_environments=True and an environment fails during step or reset.

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

kouroshHakha · 2022-07-05T14:56:33Z

rllib/evaluation/sampler.py

@@ -1082,6 +1081,7 @@ def _process_observations(
            # If reset is async, we will get its result in some future poll.
            elif resetted_obs != ASYNC_RESET_RETURN:
                new_episode: Episode = active_episodes[env_id]
+                assert not new_episode.is_faulty


suggestion: Add an informative error message here.

rllib/evaluation/sampler.py

…eval_worker_sub_env_failure_tests

sven1977 · 2022-07-08T13:13:59Z

rllib/tuned_examples/pg/cartpole-crashing-pg.yaml

@@ -2,8 +2,8 @@ cartpole-crashing-pg:
    env: ray.rllib.examples.env.cartpole_crashing.CartPoleCrashing
    run: PG
    stop:
-        episode_reward_mean: 150.0
-        timesteps_total: 120000
+        evaluation/episode_reward_mean: 180.0


Make sure, learning success is measurable by eval workers.

sven1977 · 2022-07-08T13:14:24Z

rllib/utils/debug/memory.py

@@ -83,13 +83,16 @@ def check_memory_leaks(
        action_sample = action_space.sample()

        def code():
+            horizon = algorithm.config["horizon"] or float("inf")


Respect horizon setting in this utility.

sven1977 · 2022-07-08T13:14:58Z

rllib/examples/custom_eval.py

@@ -76,6 +76,7 @@
 from ray.rllib.utils.test_utils import check_learning_achieved

 parser = argparse.ArgumentParser()
+parser.add_argument("--evaluation-parallel-to-training", action="store_true")


Add parallel option to this script and add setting this option to tests in BUILD.

sven1977 · 2022-07-08T13:15:57Z

rllib/algorithms/tests/test_algorithm.py

@@ -347,6 +350,30 @@ def test_no_env_but_eval_workers_do_have_env(self):
        bc.train()
        bc.stop()

+    def test_eval_workers_on_infinite_episodes(self):


New test to check for proper behavior if eval workers are set to run by episodes, but episodes dont terminate.

sven1977 · 2022-07-08T13:16:40Z

rllib/algorithms/algorithm_config.py

@@ -175,6 +175,7 @@ def __init__(self, algo_class=None):
        self.evaluation_interval = None
        self.evaluation_duration = 10
        self.evaluation_duration_unit = "episodes"
+        self.evaluation_sample_timeout_s = 180.0


New config value. Triggers a meaningful warning if breached with various things to try to fix the timeouts.

sven1977 · 2022-07-08T13:16:50Z

rllib/algorithms/algorithm.py

@@ -2339,6 +2358,14 @@ def _run_one_evaluation(
                    "recreate_failed_workers"
                ),
            )
+
+        # Add number of healthy evaluation workers after this iteration.


This metric was missing.

sven1977 · 2022-07-08T13:17:21Z

rllib/algorithms/algorithm.py

@@ -753,7 +753,8 @@ def duration_fn(num_units_done):
                    env_steps_this_iter += batch.env_steps()
                metrics = collect_metrics(
                    self.workers.local_worker(),
-                    keep_custom_metrics=self.config["keep_per_episode_custom_metrics"],
+                    keep_custom_metrics=eval_cfg["keep_per_episode_custom_metrics"],
+                    timeout_seconds=eval_cfg["metrics_episode_collection_timeout_s"],


bug: timeout setting was not passed to eval collect-metrics call.

kouroshHakha

Thanks @sven1977 LGTM.

…eval_worker_sub_env_failure_tests

Signed-off-by: sven1977 <[email protected]>

…eval_worker_sub_env_failure_tests

Signed-off-by: sven1977 <[email protected]>

…crashes during `reset()`; +more tests and logging; add eval worker sub-env fault tolerance test. (ray-project#26276) Signed-off-by: Xiaowei Jiang <[email protected]>

…crashes during `reset()`; +more tests and logging; add eval worker sub-env fault tolerance test. (ray-project#26276) Signed-off-by: Stefan van der Kleij <[email protected]>

sven1977 added 2 commits July 3, 2022 16:57

wip

47778f2

wip

a45511c

sven1977 requested review from gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners July 3, 2022 17:55

sven1977 added 3 commits July 3, 2022 20:10

wip

ed1d16c

LINT.

86ddb4d

wip

8e0dde4

sven1977 assigned kouroshHakha Jul 5, 2022

kouroshHakha reviewed Jul 5, 2022

View reviewed changes

rllib/evaluation/sampler.py Outdated Show resolved Hide resolved

sven1977 added 5 commits July 7, 2022 12:56

Merge branch 'master' of https://github.com/ray-project/ray into add_…

88c8cd6

…eval_worker_sub_env_failure_tests

wip

75e88e3

wip

7e038e2

wip

a1af2cb

wip

4e87dcf

sven1977 commented Jul 8, 2022

View reviewed changes

kouroshHakha approved these changes Jul 8, 2022

View reviewed changes

sven1977 added 2 commits July 9, 2022 16:52

Merge branch 'master' of https://github.com/ray-project/ray into add_…

2ec87d5

…eval_worker_sub_env_failure_tests

LINT

0866980

sven1977 added 18 commits July 9, 2022 16:55

LINT

e4e82e9

wip

9f42f70

Merge branch 'master' of https://github.com/ray-project/ray into add_…

56b182e

…eval_worker_sub_env_failure_tests

wip

11c1da8

Merge branch 'master' of https://github.com/ray-project/ray into add_…

646710b

…eval_worker_sub_env_failure_tests

LINT

387ddb0

Merge branch 'master' of https://github.com/ray-project/ray into add_…

39d9a22

…eval_worker_sub_env_failure_tests

fix

9abfd4d

fix

c651b6d

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into add_…

0058ea4

…eval_worker_sub_env_failure_tests

wip

761fe47

Signed-off-by: sven1977 <[email protected]>

wip

a876e13

Signed-off-by: sven1977 <[email protected]>

wip

9517160

Signed-off-by: sven1977 <[email protected]>

wip

28ee514

Signed-off-by: sven1977 <[email protected]>

wip

2d48d78

Signed-off-by: sven1977 <[email protected]>

wip

fe816d0

Signed-off-by: sven1977 <[email protected]>

wip

fd6eca8

Signed-off-by: sven1977 <[email protected]>

wip

688f04d

Signed-off-by: sven1977 <[email protected]>

sven1977 merged commit 4aea24c into ray-project:master Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Add eval worker sub-env fault tolerance test. #26276

[RLlib] Add eval worker sub-env fault tolerance test. #26276

sven1977 commented Jul 3, 2022 •

edited

Loading

kouroshHakha Jul 5, 2022

sven1977 Jul 8, 2022

sven1977 Jul 8, 2022

sven1977 Jul 8, 2022

sven1977 Jul 8, 2022

sven1977 Jul 8, 2022

sven1977 Jul 8, 2022

sven1977 Jul 8, 2022

kouroshHakha left a comment

[RLlib] Add eval worker sub-env fault tolerance test. #26276

[RLlib] Add eval worker sub-env fault tolerance test. #26276

Conversation

sven1977 commented Jul 3, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kouroshHakha left a comment

Choose a reason for hiding this comment

sven1977 commented Jul 3, 2022 •

edited

Loading