Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air/output] Improve leaked mentions of Tune concepts #35003

Merged
merged 12 commits into from
May 8, 2023

Conversation

krfricke
Copy link
Contributor

@krfricke krfricke commented May 3, 2023

Why are these changes needed?

Ray Tune is the execution backend for Ray Train. This means that sometimes error/warning messages use Tune concepts, that don't make sense in a single-trial run, such as with Ray Train trainers.

This PR improves three such occurrences:

  1. The insufficient resources warnings message has been adjusted in the case where only one trial is run
  2. Calculation of max_pending_trials now uses search_alg.total_samples as the minimum, which was an oversight before.
  3. On interrupt of a training run, a Tuner.restore() message was suggested, but it should be Trainer.restore()

With these fixes, we will see message such as:

Insufficient resources

Training has not started in the last 30 seconds.
This could be due to the cluster not having enough resources available.
You asked for 4 CPUs and 2 GPUs, but the cluster only 
has 4 CPUs and 1 GPUs available.
Stop the training and adjust the required resources (e.g. via the 
`ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), 
or add more resources to your cluster.

Maximum number of pending trials

The message

2023-03-28 13:51:14,756 WARNING trial_runner.py:1576 -- The maximum number of pending trials has been automatically set to the number of available cluster CPUs, which is high (176 CPUs/pending trials). If you're running an experiment with a large number of trials, this could lead to scheduling overhead. In this case, consider setting the `TUNE_MAX_PENDING_TRIALS_PG` environment variable to the desired maximum number of concurrent trials.

will not turn up for Ray Train runs anymore.

Restore

For Train runs, will now be:

Experiment has been interrupted, but the most recent state was saved.
Continue running this experiment with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

We can't get rid of all tune-related concepts just yet, but this is at least an improvement.

Related issue number

Closes #33839 (or parts of it)

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Kai Fricke added 4 commits May 3, 2023 12:49
Signed-off-by: Kai Fricke <[email protected]>
max
Signed-off-by: Kai Fricke <[email protected]>
msg
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
@krfricke krfricke added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 3, 2023
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the error message refactoring.
want to discuss about the long term plan of how to differentiate between trainer and tuner though.

trainable=trainable,
param_space=param_space,
run_config=self.run_config,
_trainer_api=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinvyu can we have a chat about adding parameters like _trainer_api in Tuner :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we generally need a good idea how to pass this information.

To me it feels like there should be some time of context. We have different requirements for different ML jobs. Even rllib vs Train has different requirements (e.g. default metrics to show), and maybe even rllib's single algorithms.

We don't have that story, yet, so in order to unblock this work, I think we can go ahead with the private flags. But yes, we should resolve this (also for telemetry).

@@ -133,12 +133,15 @@ def __init__(
callbacks: Optional[List[Callback]] = None,
metric: Optional[str] = None,
trial_checkpoint_config: Optional[CheckpointConfig] = None,
_trainer_api: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation of what this is?
can we compute whether there is only a single Trial ourselves?
I mean, feel like it would be nice to be able to avoid such a trainer parameter on Tuner init.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried this before, and long story short, it's not very straightforward due to the fact that we need some of the information pretty early, but number of trials is only calculated later. It also can lead to confusing situations - e.g. it's totally valid to use Tuner(trainable, tune_config=TuneConfig(num_samples=1)) for iteration and still expect to see a Tuner.restore() hint at the end. Tracking which object actually was the entrypoint saves us from those problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I just feel like passing such a generic sounding parameter all the way through so many components like Tuner, BackendExecutor, etc just to be able to show the right output message is kind of too heavy.
we should probably have a way out of this if we want to live with it for now. @justinvyu
another idea is maybe passing this bit through a LoggingConfig or something? we probably need such a config class anyways?

Copy link
Contributor Author

@krfricke krfricke May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're doing the same at the moment with _tuner_api, so in that sense it's consistent :-D

This is not a logging configuration in my opinion. Users should not "configure" which output/error messages they want to see. It's more of a runtime context.

Ray core has a runtime context object, I think we just need something similar for Ray AIR.

@scottsun94
Copy link
Contributor

scottsun94 commented May 4, 2023

Restore

For Train runs, will now be:> Restore

For Train runs, will now be:

Experiment has been interrupted, but the most recent state was saved.
Continue running this experiment with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

Is Experiment a concept for Train users? Should it just be

Training has been interrupted, but the most recent state was saved.
 Resume training with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

Other LGTM. Thanks!

Experiment has been interrupted, but the most recent state was saved.
Continue running this experiment with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

Is Experiment a concept for Train users? Should it just be

**Training** has been interrupted, but the most recent state was saved.
 **Resume training** with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

Others LGTM. Thanks!

@krfricke
Copy link
Contributor Author

krfricke commented May 4, 2023

Updated, thanks!

# Conflicts:
#	python/ray/tune/impl/tuner_internal.py
Copy link
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I missed this.
let's add a TODO and unblock the output work.

@krfricke krfricke merged commit 3c2d77e into ray-project:master May 8, 2023
@krfricke krfricke deleted the air/output/improve-tune-leaks branch May 8, 2023 09:52
architkulkarni pushed a commit to architkulkarni/ray that referenced this pull request May 16, 2023
)

Ray Tune is the execution backend for Ray Train. This means that sometimes error/warning messages use Tune concepts, that don't make sense in a single-trial run, such as with Ray Train trainers.

This PR improves three such occurrences:

1. The insufficient resources warnings message has been adjusted in the case where only one trial is run
2. Calculation of `max_pending_trials` now uses `search_alg.total_samples` as the minimum, which was an oversight before.
3. On interrupt of a training run, a `Tuner.restore()` message was suggested, but it should be `Trainer.restore()`

Signed-off-by: Kai Fricke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AIR output] Tune concepts are leaked when I only used a trainer
3 participants