[RLlib] Smaller eval worker set fixes. #28811

sven1977 · 2022-09-27T11:39:29Z

Signed-off-by: sven1977 [email protected]

Smaller eval worker set fixes.

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

sven1977 · 2022-09-27T11:39:57Z

rllib/algorithms/algorithm.py

@@ -1147,6 +1148,11 @@ def remote_fn(worker, w_ref, w_seq_no):
        # subsequent step results as latest evaluation result.
        self.evaluation_metrics = {"evaluation": metrics}

+        # Trigger `on_evaluate_end` callback.


This callback was missing for enable_async_evaluation=True setting.

sven1977 · 2022-09-27T11:40:17Z

rllib/algorithms/algorithm.py

@@ -2701,7 +2707,11 @@ def _run_one_evaluation(
            if self.evaluation_workers is not None
            else 0
        )
-        eval_results["evaluation"]["num_recreated_workers"] = num_recreated
+        # Worker failures might have already been handled within `self._evaluate_async`


This count would be overridden by 0 for enable_async_evaluation=True.

🤔 wonder if we should move this try_recover_from_step_attempt() into self.evaluate() then.
or, I actually need to refactor all the remote gets anyways for elastic training. so maybe I can clean this up when I get to remote_req_manager.

gjoliver

ok, can you maybe add a TODO for me above the if statement, like:

TODO(jungong) : revisit after elastic async evaluation is done.

gjoliver · 2022-09-27T16:25:55Z

rllib/algorithms/algorithm.py

@@ -2701,7 +2707,11 @@ def _run_one_evaluation(
            if self.evaluation_workers is not None
            else 0
        )
-        eval_results["evaluation"]["num_recreated_workers"] = num_recreated
+        # Worker failures might have already been handled within `self._evaluate_async`


🤔 wonder if we should move this try_recover_from_step_attempt() into self.evaluate() then.
or, I actually need to refactor all the remote gets anyways for elastic training. so maybe I can clean this up when I get to remote_req_manager.

Signed-off-by: sven1977 <[email protected]>

…l_eval_fixes

Signed-off-by: Weichen Xu <[email protected]>

wip

f048c32

Signed-off-by: sven1977 <[email protected]>

sven1977 requested review from gjoliver, avnishn, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners September 27, 2022 11:39

sven1977 assigned gjoliver Sep 27, 2022

sven1977 commented Sep 27, 2022

View reviewed changes

gjoliver approved these changes Sep 27, 2022

View reviewed changes

sven1977 added 3 commits September 27, 2022 21:46

wip

f519f45

Signed-off-by: sven1977 <[email protected]>

fix

2938ae7

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into smal…

b6b231b

…l_eval_fixes

sven1977 merged commit 0686f36 into ray-project:master Sep 28, 2022

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022

[RLlib] Smaller eval worker set fixes. (ray-project#28811)

a1985e0

Signed-off-by: Weichen Xu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Smaller eval worker set fixes. #28811

[RLlib] Smaller eval worker set fixes. #28811

sven1977 commented Sep 27, 2022 •

edited

Loading

sven1977 Sep 27, 2022

sven1977 Sep 27, 2022

gjoliver Sep 27, 2022

gjoliver left a comment

gjoliver Sep 27, 2022

[RLlib] Smaller eval worker set fixes. #28811

[RLlib] Smaller eval worker set fixes. #28811

Conversation

sven1977 commented Sep 27, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

sven1977 Sep 27, 2022

Choose a reason for hiding this comment

sven1977 Sep 27, 2022

Choose a reason for hiding this comment

gjoliver Sep 27, 2022

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

gjoliver Sep 27, 2022

Choose a reason for hiding this comment

sven1977 commented Sep 27, 2022 •

edited

Loading