-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] More checkpoint configurability, Result
extension
#25943
[AIR] More checkpoint configurability, Result
extension
#25943
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Yard1!
I'm wondering if we can simplify the API though. As a user all I need is to
- Get all reported metrics ordered by iteration
- Get all checkpoint paths and its corresponding metrics
If I have these then it's pretty easy for me to calculate the best metric and best checkpoint based on whatever attribute I want without needing to be concerned about checkpoint config or tune config.
If the resulting dataframe contains checkpoint paths, would it be sufficient to expose that?
@amogkam ok, let me see |
@amogkam @xwjiang2010 simplified, PTAL @xwjiang2010 There was an oversight in function runner, where the duplicate result would get checkpointed again. I have disabled that in order for checkpoint_history to have the correct number of checkpoints, but let's see if it doesn't break CI somewhere else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there
f.write(json.dumps(dict(x=config["x"], step=i))) | ||
tune.report(x=config["x"], step=i) | ||
|
||
analysis = tune.run(f, config={"x": tune.grid_search([1, 3])}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of scope here but we should really find a way to construct experiment checkpoints without always running full runs for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
Uses the new AIR Train API for examples and tests. The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers. This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs. Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled. Requires #25943 to be merged in first
Uses the new AIR Train API for examples and tests. The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers. This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs. Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled. Requires ray-project#25943 to be merged in first Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed?
This PR:
keep_checkpoints_num
andcheckpoint_score_attr
inRunConfig
using theCheckpointStrategy
dataclassResult
object -best_checkpoints
- a list of saved best checkpoints as determined byCheckpointingConfig
.Related issue number
Closes #24868
Checks
scripts/format.sh
to lint the changes in this PR.