[air/tune] Add resume experiment options to Tuner.restore() #26826

krfricke · 2022-07-21T09:49:09Z

Why are these changes needed?

We can't currently configure resume options in Ray AIR's Tuner API. Additionally, resume configuration in Tune is generally limited to specific cases.

This PR adds arguments to Tuner.restore() to configure resume behavior.
It also updates tune.run()'s resume configuration to take in suffixes. Users can configure if they want to restart/resume errored trials, and they can do this in any combination with the LOCAL/REMOTE/AUTO keys.

Implementation details: The _ResumeConfig is currently a private class to hold information and only passed around internally. The user facing API are the Tuner.restore() kwargs, and for tune.run() it remains the string.
The reason I didn't promote it to an actual config is because the Tuner.restore() kwargs seems nicer and we want to avoid introducing new configuration classes for the legacy tune.run() API. We can choose to promote this if the need arises later.

Related issue number

Addresses part of #24918

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <[email protected]>

xwjiang2010 · 2022-07-21T14:57:36Z

I wonder why flat kwargs are preferred. Looking across other AIR APIs, we seem to prefer Config. Promoting to Config may require additional deprecation cycles. So want to be cautious here.

Btw, how does restore work together with lazy datasource dataset? The current contract for a Tuner run is that user is expected to run fully_executed before passing in the dataset to Tuner. This becomes impossible to do if we are restoring a Tuner I suppose??

krfricke · 2022-07-21T22:00:43Z

For the dataset, I think there are several options

Don't support it for now, and document that
Automatically call fully_executed
Add a Tuner.execute_datasets() method to Tuner which calls fully_execute on the datasets
Make datasets accessible (eg. tuner.datasets) so users can call fully_execute themselves

I think 1 or 4 are both equally good options. 4 shoul be already possible by using something like

[ds.fully_execute() for ds in tuner.param_space["datasets"].values()]

right? 3 would be a shortcut for that, and 2 would be a shortcut for tuner.param_space.get("datasets", {})

krfricke · 2022-07-21T22:02:47Z

For kwargs vs. config, the main reason we use configs in other places is to not clutter the argument space of more high-level APIs (constructors) and group similar arguments together, and still provide typing hints. In restore() we don't really need that as it is a single method and naturally grouped, and equally typed. I'd have a preference to stick with arguments here, but let me know if you feel strongly about this. Also cc @richardliaw

richardliaw · 2022-07-22T03:54:33Z

general rule for config is only if flat arg docstring will go past 8 args. so agree with Kai approach for now.

richardliaw · 2022-07-22T03:55:18Z

for dataset lets just punt it and for people who need it we should ask them to upvote and comment on a specific issue.

xwjiang2010 · 2022-07-22T05:11:29Z

python/ray/tune/tune.py

+        resume: One of [True, False, "LOCAL", "REMOTE", "PROMPT", "AUTO"]. Can
+            be suffixed with one or more of ["+ERRORED", "+ERRORED_ONLY",
+            "+RESTART_ERRORED", "+RESTART_ERRORED_ONLY"] (e.g. ``AUTO+ERRORED``).
+            "LOCAL"/True restores the checkpoint from the


tune.run is still external facing at this point right?
My understanding is user should not bother with using tune.run with "xxx+yyy" sort of configurations?
so question is should we still keep the docstring unchanged?

or should we consider just having an internal only argument for resume config for tune.run - to spare the trouble of "xxx+yyy" and then parse it later?

I actually think we should just support the +suffix strings - users have been asking for e.g. REMOTE+ERRORED in the past, and this way they can also do this with the tune.run API.
In this case I also think it's fine to keep it as there is a clear conversion between the strings and the internal resume config - thus if we decide to later change the tune.run() API to use a config, we can easily deprecate strings (which we would have to do anyways).

richardliaw · 2022-07-22T08:28:36Z

python/ray/tune/tuner.py

+            resume_unfinished: If True, will continue to run unfinished trials.
+            resume_errored: If True, will re-schedule errored trials and try to
+                restore from their latest checkpoints.
+            restart_errored: If True, will re-schedule errored trials but force
+                restarting them frmo scratch (no checkpoint will be loaded).


Typo in "from"

How do these three interact with each other? I think we should document this. For example, what happens if you do resume_errored=True and restart_errored=True? etc

OK just thought this through. So basically there are 3 groups of trials:

Finished, Unfinished, and Errored.

For Finished, we should skip them. Nothing to do.
For Unfinished, we should provide the options {skip, resume}.
For Errored, we should provide {skip, restart, resume from checkpoint}.

And all 3 groups of trials should be orthogonal to each other.

Can we provide a grouping similar to the above?

Isn't this exactly what we currently do?

Finished are always added and show up as TERMINATED in the stats. They are not re-run but users will have them in their results table

Unfinished: We have resume_unfinished for this

Errored: We have resume_errored and restart_errored for this. If None are set, they are skipped (turn up as ERRORed in the result). If resume is set, they are resumed from checkpoint. If restart is set, they are restarted from checkpoint

Or is the naming/documentation confusing?

I've added a more extensive docstring to explain the behavior

sounds good, what i meant was to have 1 flag for the errored, not two.

Big Plus! This a great benefit to us. As before we had sometimes a lot of pain and crying when e.g. RL workloads were running for days on GPUs and then errored out because of full disks. The runs could not be recovered as the driver held the data but a new cluster could not use the upload_dir. I will try this out in my next runs on GCP and give some feedback. Thank you all for implementing this! Spares us costs, time and nerves :)

This should resolve #24918

One thing I remember:

When using the trial_dirname_creator in tune.run() the resumed experiment did not use it, while the errored one did. Did anyone also run into this issue?

Signed-off-by: Kai Fricke <[email protected]>

…ect#26826) Signed-off-by: Rohan138 <[email protected]>

…ect#26826) Signed-off-by: Stefan van der Kleij <[email protected]>

Kai Fricke added 3 commits July 20, 2022 18:39

WIP: Tuner/restore

79997be

Signed-off-by: Kai Fricke <[email protected]>

Add more extensive tests

cb12f9b

Signed-off-by: Kai Fricke <[email protected]>

Mark pending trials as terminated instead

172bb59

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested review from richardliaw and xwjiang2010 July 21, 2022 10:03

krfricke assigned richardliaw and xwjiang2010 Jul 21, 2022

krfricke changed the title ~~[wip] [air/tune] Add resume experiment options to Tuner.restore()~~ [air/tune] Add resume experiment options to Tuner.restore() Jul 21, 2022

krfricke marked this pull request as ready for review July 21, 2022 10:03

Kai Fricke added 3 commits July 21, 2022 12:09

new line

14b78c6

Signed-off-by: Kai Fricke <[email protected]>

Lint

aaa87aa

Signed-off-by: Kai Fricke <[email protected]>

lint.......

1888dd2

Signed-off-by: Kai Fricke <[email protected]>

xwjiang2010 reviewed Jul 22, 2022

View reviewed changes

richardliaw reviewed Jul 22, 2022

View reviewed changes

docstring

ba74c7a

Signed-off-by: Kai Fricke <[email protected]>

xwjiang2010 approved these changes Jul 22, 2022

View reviewed changes

richardliaw merged commit 6074300 into ray-project:master Jul 22, 2022

krfricke deleted the tune/resume-config branch July 22, 2022 16:04

Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022

[air/tune] Add resume experiment options to Tuner.restore() (ray-proj…

fba0fec

…ect#26826) Signed-off-by: Rohan138 <[email protected]>

krfricke mentioned this pull request Jul 28, 2022

[tune] Better resumption of stopped or errored experiments #24918

Closed

krfricke mentioned this pull request Aug 17, 2022

[Bug] No space left on device keeps occurring #23674

Closed

2 tasks

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

[air/tune] Add resume experiment options to Tuner.restore() (ray-proj…

abe91ed

…ect#26826) Signed-off-by: Stefan van der Kleij <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air/tune] Add resume experiment options to Tuner.restore() #26826

[air/tune] Add resume experiment options to Tuner.restore() #26826

krfricke commented Jul 21, 2022 •

edited

Loading

xwjiang2010 commented Jul 21, 2022 •

edited

Loading

krfricke commented Jul 21, 2022

krfricke commented Jul 21, 2022

richardliaw commented Jul 22, 2022

richardliaw commented Jul 22, 2022

xwjiang2010 Jul 22, 2022

xwjiang2010 Jul 22, 2022

krfricke Jul 22, 2022

richardliaw Jul 22, 2022

richardliaw Jul 22, 2022

krfricke Jul 22, 2022

krfricke Jul 22, 2022

krfricke Jul 22, 2022

richardliaw Jul 22, 2022

simonsays1980 Jul 28, 2022

simonsays1980 Jul 28, 2022

simonsays1980 Jul 28, 2022

[air/tune] Add resume experiment options to Tuner.restore() #26826

[air/tune] Add resume experiment options to Tuner.restore() #26826

Conversation

krfricke commented Jul 21, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

xwjiang2010 commented Jul 21, 2022 • edited Loading

krfricke commented Jul 21, 2022

krfricke commented Jul 21, 2022

richardliaw commented Jul 22, 2022

richardliaw commented Jul 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke commented Jul 21, 2022 •

edited

Loading

xwjiang2010 commented Jul 21, 2022 •

edited

Loading