[AIR] More checkpoint configurability, `Result` extension #25943

Yard1 · 2022-06-20T20:24:57Z

Why are these changes needed?

This PR:

Allows the user to set keep_checkpoints_num and checkpoint_score_attr in RunConfig using the CheckpointStrategy dataclass
Adds two new fields to the Result object - best_checkpoints - a list of saved best checkpoints as determined by CheckpointingConfig.

Related issue number

Closes #24868

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/air/config.py

python/ray/air/result.py

python/ray/air/config.py

python/ray/air/result.py

amogkam

Thanks @Yard1!

I'm wondering if we can simplify the API though. As a user all I need is to

Get all reported metrics ordered by iteration
Get all checkpoint paths and its corresponding metrics

If I have these then it's pretty easy for me to calculate the best metric and best checkpoint based on whatever attribute I want without needing to be concerned about checkpoint config or tune config.

If the resulting dataframe contains checkpoint paths, would it be sufficient to expose that?

python/ray/air/config.py

python/ray/tune/result_grid.py

python/ray/air/result.py

Yard1 · 2022-06-21T20:27:30Z

@amogkam ok, let me see

Yard1 · 2022-06-21T20:52:57Z

@amogkam @xwjiang2010 simplified, PTAL

@xwjiang2010 There was an oversight in function runner, where the duplicate result would get checkpointed again. I have disabled that in order for checkpoint_history to have the correct number of checkpoints, but let's see if it doesn't break CI somewhere else.

python/ray/air/result.py

python/ray/train/__init__.py

krfricke

Almost there

python/ray/util/ml_utils/checkpoint_manager.py

krfricke · 2022-06-28T17:58:17Z

python/ray/tune/tests/test_result_grid.py

+                    f.write(json.dumps(dict(x=config["x"], step=i)))
+            tune.report(x=config["x"], step=i)
+
+    analysis = tune.run(f, config={"x": tune.grid_search([1, 3])})


out of scope here but we should really find a way to construct experiment checkpoints without always running full runs for testing.

krfricke

Great, thanks!

python/ray/util/ml_utils/checkpoint_manager.py

Uses the new AIR Train API for examples and tests. The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers. This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs. Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled. Requires #25943 to be merged in first

Uses the new AIR Train API for examples and tests. The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers. This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs. Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled. Requires ray-project#25943 to be merged in first Signed-off-by: Stefan van der Kleij <[email protected]>

Yard1 added 5 commits June 20, 2022 16:10

Add checkpoint configuration to RunConfig

86a71d6

Add best_checkpoint and dataframe to Result

41eb780

Tests, fixes

eb2eb67

Result grid tweaks

024932e

Extend

abf2cdc

Yard1 added this to the Ray AIR milestone Jun 20, 2022

Yard1 requested review from amogkam, krfricke and xwjiang2010 June 20, 2022 20:24

Yard1 assigned amogkam and xwjiang2010 Jun 20, 2022

Yard1 added 2 commits June 20, 2022 23:37

Merge branch 'ray-project:master' into more_checkpoint_configurability

1f1d28b

Update result_grid.py

563bc33

Yard1 mentioned this pull request Jun 21, 2022

[AIR/train] Use new Train API #25735

Merged

6 tasks

Yard1 added 3 commits June 21, 2022 16:58

Fix

d0261be

Lint

56df493

Lint

ef0c75a

xwjiang2010 reviewed Jun 21, 2022

View reviewed changes

python/ray/air/config.py Outdated Show resolved Hide resolved

python/ray/air/result.py Outdated Show resolved Hide resolved

xwjiang2010 reviewed Jun 21, 2022

View reviewed changes

python/ray/air/result.py Show resolved Hide resolved

python/ray/air/config.py Outdated Show resolved Hide resolved

python/ray/air/result.py Outdated Show resolved Hide resolved

Renaming

ee87c12

amogkam reviewed Jun 21, 2022

View reviewed changes

Simplify

4dbccca

Yard1 added 3 commits June 21, 2022 20:54

Docstring tweak

27e531c

Remove docstring

7d1abfe

Fix

b0dd3ba

Yard1 requested review from amogkam and xwjiang2010 June 21, 2022 21:00

amogkam reviewed Jun 21, 2022

View reviewed changes

python/ray/air/result.py Show resolved Hide resolved

Yard1 requested a review from ericl as a code owner June 24, 2022 21:33

Fix lint

b993627

Yard1 commented Jun 24, 2022

View reviewed changes

python/ray/train/__init__.py Outdated Show resolved Hide resolved

Yard1 added 2 commits June 24, 2022 15:46

Update python/ray/train/__init__.py

ed870bd

Merge branch 'master' into more_checkpoint_configurability

ad90782

krfricke reviewed Jun 28, 2022

View reviewed changes

Yard1 added 2 commits June 28, 2022 11:21

Use warnings.warn

d0ae2ba

Make method privat

d44f750

Yard1 requested a review from krfricke June 28, 2022 18:48

krfricke approved these changes Jun 28, 2022

View reviewed changes

Yard1 commented Jun 28, 2022

View reviewed changes

python/ray/util/ml_utils/checkpoint_manager.py Outdated Show resolved Hide resolved

Yard1 added 2 commits June 28, 2022 13:39

Update python/ray/util/ml_utils/checkpoint_manager.py

c9d3380

Update checkpoint_manager.py

5c0a753

richardliaw approved these changes Jun 29, 2022

View reviewed changes

krfricke merged commit dc7ed08 into ray-project:master Jun 29, 2022

Yard1 deleted the more_checkpoint_configurability branch June 29, 2022 15:41

richardliaw mentioned this pull request Jul 6, 2022

[air] Support RunConfig.CheckpointStrategy #25726

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] More checkpoint configurability, `Result` extension #25943

[AIR] More checkpoint configurability, `Result` extension #25943

Yard1 commented Jun 20, 2022 •

edited

Loading

amogkam left a comment

Yard1 commented Jun 21, 2022

Yard1 commented Jun 21, 2022

krfricke left a comment

krfricke Jun 28, 2022

krfricke left a comment

[AIR] More checkpoint configurability, Result extension #25943

[AIR] More checkpoint configurability, Result extension #25943

Conversation

Yard1 commented Jun 20, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

amogkam left a comment

Choose a reason for hiding this comment

Yard1 commented Jun 21, 2022

Yard1 commented Jun 21, 2022

krfricke left a comment

Choose a reason for hiding this comment

krfricke Jun 28, 2022

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

[AIR] More checkpoint configurability, `Result` extension #25943

[AIR] More checkpoint configurability, `Result` extension #25943

Yard1 commented Jun 20, 2022 •

edited

Loading