-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air/tune] Add resume experiment options to Tuner.restore() #26826
Conversation
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
I wonder why flat kwargs are preferred. Looking across other AIR APIs, we seem to prefer Config. Promoting to Config may require additional deprecation cycles. So want to be cautious here. Btw, how does restore work together with lazy datasource dataset? The current contract for a Tuner run is that user is expected to run |
For the dataset, I think there are several options
I think 1 or 4 are both equally good options. 4 shoul be already possible by using something like
right? 3 would be a shortcut for that, and 2 would be a shortcut for |
For kwargs vs. config, the main reason we use configs in other places is to not clutter the argument space of more high-level APIs (constructors) and group similar arguments together, and still provide typing hints. In |
general rule for config is only if flat arg docstring will go past 8 args. so agree with Kai approach for now. |
for dataset lets just punt it and for people who need it we should ask them to upvote and comment on a specific issue. |
resume: One of [True, False, "LOCAL", "REMOTE", "PROMPT", "AUTO"]. Can | ||
be suffixed with one or more of ["+ERRORED", "+ERRORED_ONLY", | ||
"+RESTART_ERRORED", "+RESTART_ERRORED_ONLY"] (e.g. ``AUTO+ERRORED``). | ||
"LOCAL"/True restores the checkpoint from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tune.run
is still external facing at this point right?
My understanding is user should not bother with using tune.run with "xxx+yyy" sort of configurations?
so question is should we still keep the docstring unchanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or should we consider just having an internal only argument for resume config for tune.run
- to spare the trouble of "xxx+yyy" and then parse it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we should just support the +suffix
strings - users have been asking for e.g. REMOTE+ERRORED in the past, and this way they can also do this with the tune.run API.
In this case I also think it's fine to keep it as there is a clear conversion between the strings and the internal resume config - thus if we decide to later change the tune.run() API to use a config, we can easily deprecate strings (which we would have to do anyways).
python/ray/tune/tuner.py
Outdated
resume_unfinished: If True, will continue to run unfinished trials. | ||
resume_errored: If True, will re-schedule errored trials and try to | ||
restore from their latest checkpoints. | ||
restart_errored: If True, will re-schedule errored trials but force | ||
restarting them frmo scratch (no checkpoint will be loaded). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in "from"
How do these three interact with each other? I think we should document this. For example, what happens if you do resume_errored=True
and restart_errored=True
? etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK just thought this through. So basically there are 3 groups of trials:
Finished, Unfinished, and Errored.
For Finished, we should skip them. Nothing to do.
For Unfinished, we should provide the options {skip, resume}.
For Errored, we should provide {skip, restart, resume from checkpoint}.
And all 3 groups of trials should be orthogonal to each other.
Can we provide a grouping similar to the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this exactly what we currently do?
- Finished are always added and show up as TERMINATED in the stats. They are not re-run but users will have them in their results table
- Unfinished: We have
resume_unfinished
for this - Errored: We have
resume_errored
andrestart_errored
for this. If None are set, they are skipped (turn up as ERRORed in the result). If resume is set, they are resumed from checkpoint. If restart is set, they are restarted from checkpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is the naming/documentation confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a more extensive docstring to explain the behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, what i meant was to have 1 flag for the errored, not two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big Plus! This a great benefit to us. As before we had sometimes a lot of pain and crying when e.g. RL workloads were running for days on GPUs and then errored out because of full disks. The runs could not be recovered as the driver held the data but a new cluster could not use the upload_dir
. I will try this out in my next runs on GCP and give some feedback. Thank you all for implementing this! Spares us costs, time and nerves :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should resolve #24918
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I remember:
When using the trial_dirname_creator
in tune.run()
the resumed experiment did not use it, while the errored one did. Did anyone also run into this issue?
Signed-off-by: Kai Fricke <[email protected]>
…ect#26826) Signed-off-by: Rohan138 <[email protected]>
…ect#26826) Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed?
We can't currently configure resume options in Ray AIR's Tuner API. Additionally, resume configuration in Tune is generally limited to specific cases.
This PR adds arguments to
Tuner.restore()
to configure resume behavior.It also updates
tune.run()'s
resume configuration to take in suffixes. Users can configure if they want to restart/resume errored trials, and they can do this in any combination with the LOCAL/REMOTE/AUTO keys.Implementation details: The
_ResumeConfig
is currently a private class to hold information and only passed around internally. The user facing API are theTuner.restore()
kwargs, and fortune.run()
it remains the string.The reason I didn't promote it to an actual config is because the
Tuner.restore()
kwargs seems nicer and we want to avoid introducing new configuration classes for the legacytune.run()
API. We can choose to promote this if the need arises later.Related issue number
Addresses part of #24918
Checks
scripts/format.sh
to lint the changes in this PR.