Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air/tune] Add resume experiment options to Tuner.restore() #26826

Merged
merged 7 commits into from
Jul 22, 2022

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented Jul 21, 2022

Why are these changes needed?

We can't currently configure resume options in Ray AIR's Tuner API. Additionally, resume configuration in Tune is generally limited to specific cases.

This PR adds arguments to Tuner.restore() to configure resume behavior.
It also updates tune.run()'s resume configuration to take in suffixes. Users can configure if they want to restart/resume errored trials, and they can do this in any combination with the LOCAL/REMOTE/AUTO keys.

Implementation details: The _ResumeConfig is currently a private class to hold information and only passed around internally. The user facing API are the Tuner.restore() kwargs, and for tune.run() it remains the string.
The reason I didn't promote it to an actual config is because the Tuner.restore() kwargs seems nicer and we want to avoid introducing new configuration classes for the legacy tune.run() API. We can choose to promote this if the need arises later.

Related issue number

Addresses part of #24918

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 3 commits July 20, 2022 18:39
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
@krfricke krfricke changed the title [wip] [air/tune] Add resume experiment options to Tuner.restore() [air/tune] Add resume experiment options to Tuner.restore() Jul 21, 2022
@krfricke krfricke marked this pull request as ready for review July 21, 2022 10:03
Kai Fricke added 3 commits July 21, 2022 12:09
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
@xwjiang2010
Copy link
Contributor

xwjiang2010 commented Jul 21, 2022

I wonder why flat kwargs are preferred. Looking across other AIR APIs, we seem to prefer Config. Promoting to Config may require additional deprecation cycles. So want to be cautious here.

Btw, how does restore work together with lazy datasource dataset? The current contract for a Tuner run is that user is expected to run fully_executed before passing in the dataset to Tuner. This becomes impossible to do if we are restoring a Tuner I suppose??

@krfricke
Copy link
Contributor Author

For the dataset, I think there are several options

  1. Don't support it for now, and document that
  2. Automatically call fully_executed
  3. Add a Tuner.execute_datasets() method to Tuner which calls fully_execute on the datasets
  4. Make datasets accessible (eg. tuner.datasets) so users can call fully_execute themselves

I think 1 or 4 are both equally good options. 4 shoul be already possible by using something like

[ds.fully_execute() for ds in tuner.param_space["datasets"].values()]

right? 3 would be a shortcut for that, and 2 would be a shortcut for tuner.param_space.get("datasets", {})

@krfricke
Copy link
Contributor Author

For kwargs vs. config, the main reason we use configs in other places is to not clutter the argument space of more high-level APIs (constructors) and group similar arguments together, and still provide typing hints. In restore() we don't really need that as it is a single method and naturally grouped, and equally typed. I'd have a preference to stick with arguments here, but let me know if you feel strongly about this. Also cc @richardliaw

@richardliaw
Copy link
Contributor

general rule for config is only if flat arg docstring will go past 8 args. so agree with Kai approach for now.

@richardliaw
Copy link
Contributor

for dataset lets just punt it and for people who need it we should ask them to upvote and comment on a specific issue.

resume: One of [True, False, "LOCAL", "REMOTE", "PROMPT", "AUTO"]. Can
be suffixed with one or more of ["+ERRORED", "+ERRORED_ONLY",
"+RESTART_ERRORED", "+RESTART_ERRORED_ONLY"] (e.g. ``AUTO+ERRORED``).
"LOCAL"/True restores the checkpoint from the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tune.run is still external facing at this point right?
My understanding is user should not bother with using tune.run with "xxx+yyy" sort of configurations?
so question is should we still keep the docstring unchanged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or should we consider just having an internal only argument for resume config for tune.run - to spare the trouble of "xxx+yyy" and then parse it later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should just support the +suffix strings - users have been asking for e.g. REMOTE+ERRORED in the past, and this way they can also do this with the tune.run API.
In this case I also think it's fine to keep it as there is a clear conversion between the strings and the internal resume config - thus if we decide to later change the tune.run() API to use a config, we can easily deprecate strings (which we would have to do anyways).

Comment on lines 163 to 167
resume_unfinished: If True, will continue to run unfinished trials.
resume_errored: If True, will re-schedule errored trials and try to
restore from their latest checkpoints.
restart_errored: If True, will re-schedule errored trials but force
restarting them frmo scratch (no checkpoint will be loaded).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in "from"

How do these three interact with each other? I think we should document this. For example, what happens if you do resume_errored=True and restart_errored=True? etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK just thought this through. So basically there are 3 groups of trials:

Finished, Unfinished, and Errored.

For Finished, we should skip them. Nothing to do.
For Unfinished, we should provide the options {skip, resume}.
For Errored, we should provide {skip, restart, resume from checkpoint}.

And all 3 groups of trials should be orthogonal to each other.

Can we provide a grouping similar to the above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this exactly what we currently do?

  • Finished are always added and show up as TERMINATED in the stats. They are not re-run but users will have them in their results table
  • Unfinished: We have resume_unfinished for this
  • Errored: We have resume_errored and restart_errored for this. If None are set, they are skipped (turn up as ERRORed in the result). If resume is set, they are resumed from checkpoint. If restart is set, they are restarted from checkpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the naming/documentation confusing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a more extensive docstring to explain the behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, what i meant was to have 1 flag for the errored, not two.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big Plus! This a great benefit to us. As before we had sometimes a lot of pain and crying when e.g. RL workloads were running for days on GPUs and then errored out because of full disks. The runs could not be recovered as the driver held the data but a new cluster could not use the upload_dir. I will try this out in my next runs on GCP and give some feedback. Thank you all for implementing this! Spares us costs, time and nerves :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should resolve #24918

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I remember:

When using the trial_dirname_creator in tune.run() the resumed experiment did not use it, while the errored one did. Did anyone also run into this issue?

Signed-off-by: Kai Fricke <[email protected]>
@richardliaw richardliaw merged commit 6074300 into ray-project:master Jul 22, 2022
@krfricke krfricke deleted the tune/resume-config branch July 22, 2022 16:04
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants