[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

sven1977 · 2024-06-03T08:52:34Z

Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint.

Contains also a bug fix for MetricsLogger and Stats and a small API change wrt. MetricsLogger.peek() (key instead of *key to unify signature with all the other methods of MetricsLogger).

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: sven1977 <[email protected]>

…nup_examples_folder_14_continue_training_from_checkpoint

Signed-off-by: sven1977 <[email protected]>

simonsays1980

LGTM. Invaluable example for users!

simonsays1980 · 2024-06-03T09:17:39Z

rllib/examples/checkpoints/continue_training_from_checkpoint.py

+    tuner = tune.Tuner(
+        trainable=config.algo_class,
+        param_space=config,
+        run_config=air.RunConfig(


In regard to the future deprecation of air: Can we use ray.train.RunConfig here instead?

simonsays1980 · 2024-06-03T09:18:21Z

rllib/examples/checkpoints/continue_training_from_checkpoint.py

+        param_space=config,
+        run_config=air.RunConfig(
+            callbacks=tune_callbacks,
+            checkpoint_config=air.CheckpointConfig(


Same here: can we use ray.train.CheckpointConfig?

simonsays1980 · 2024-06-03T09:22:33Z

rllib/examples/checkpoints/continue_training_from_checkpoint.py

+    results = tuner.fit()
+    experiment_name = Path(results.experiment_path).name
+
+    # Extract the latest checkpoint from the results and confirm it's the right one.


Let's state this comment differently. The get_best_result gets us only in this specific setup and only with a checkpoint frequency of 1 the latest checkpoint, otherwise we get the one with the highest episode_return_mean from whenever this happened.

Ah, good catch! Yes, in this example, we should probably just use the last checkpoint, not necessarily the best. ...

simonsays1980 · 2024-06-03T09:25:59Z

rllib/examples/checkpoints/restore_1_of_n_agents_from_checkpoint.py

-        # TODO (simon): Change to -800 once the metrics are fixed. Currently
-        # the combined return is not correctly computed.
-        f"{ENV_RUNNER_RESULTS}/episode_return_mean": -400,
+        f"{ENV_RUNNER_RESULTS}/{EPISODE_RETURN_MEAN}": -800,


Great catch!

Signed-off-by: sven1977 <[email protected]>

…nup_examples_folder_14_continue_training_from_checkpoint

Signed-off-by: sven1977 <[email protected]>

…r how to resume a tune.Tuner.fit() experiment from a checkpoint. (ray-project#45681) Signed-off-by: Richard Liu <[email protected]>

sven1977 added 3 commits June 1, 2024 14:28

wip

840987d

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into clea…

f7c1e8c

…nup_examples_folder_14_continue_training_from_checkpoint

wip

9d7cf93

Signed-off-by: sven1977 <[email protected]>

sven1977 requested review from ArturNiederfahrenhorst and simonsays1980 as code owners June 3, 2024 08:52

sven1977 assigned simonsays1980 Jun 3, 2024

wip

cca427d

Signed-off-by: sven1977 <[email protected]>

simonsays1980 approved these changes Jun 3, 2024

View reviewed changes

sven1977 added 2 commits June 3, 2024 12:08

small fixes

21678bb

Signed-off-by: sven1977 <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into clea…

ca00a0b

…nup_examples_folder_14_continue_training_from_checkpoint

sven1977 enabled auto-merge (squash) June 3, 2024 10:08

github-actions bot disabled auto-merge June 3, 2024 10:08

github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 3, 2024

sven1977 enabled auto-merge (squash) June 3, 2024 10:18

fix

1ed0f60

Signed-off-by: sven1977 <[email protected]>

github-actions bot disabled auto-merge June 3, 2024 11:38

sven1977 enabled auto-merge (squash) June 3, 2024 13:28

sven1977 merged commit 440aa81 into ray-project:master Jun 3, 2024
7 checks passed

sven1977 deleted the cleanup_examples_folder_14_continue_training_from_checkpoint branch June 3, 2024 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

sven1977 commented Jun 3, 2024 •

edited

Loading

simonsays1980 left a comment

simonsays1980 Jun 3, 2024

sven1977 Jun 3, 2024

simonsays1980 Jun 3, 2024

sven1977 Jun 3, 2024

simonsays1980 Jun 3, 2024

sven1977 Jun 3, 2024

simonsays1980 Jun 3, 2024

[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

[RLlib] Cleanup examples folder #14: Add example script for how to resume a tune.Tuner.fit() experiment from a checkpoint. #45681

Conversation

sven1977 commented Jun 3, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

simonsays1980 left a comment

Choose a reason for hiding this comment

simonsays1980 Jun 3, 2024

Choose a reason for hiding this comment

sven1977 Jun 3, 2024

Choose a reason for hiding this comment

simonsays1980 Jun 3, 2024

Choose a reason for hiding this comment

sven1977 Jun 3, 2024

Choose a reason for hiding this comment

simonsays1980 Jun 3, 2024

Choose a reason for hiding this comment

sven1977 Jun 3, 2024

Choose a reason for hiding this comment

simonsays1980 Jun 3, 2024

Choose a reason for hiding this comment

sven1977 commented Jun 3, 2024 •

edited

Loading