[air/output] Improve leaked mentions of Tune concepts #35003

krfricke · 2023-05-03T11:10:52Z

Why are these changes needed?

Ray Tune is the execution backend for Ray Train. This means that sometimes error/warning messages use Tune concepts, that don't make sense in a single-trial run, such as with Ray Train trainers.

This PR improves three such occurrences:

The insufficient resources warnings message has been adjusted in the case where only one trial is run
Calculation of max_pending_trials now uses search_alg.total_samples as the minimum, which was an oversight before.
On interrupt of a training run, a Tuner.restore() message was suggested, but it should be Trainer.restore()

With these fixes, we will see message such as:

Insufficient resources

Training has not started in the last 30 seconds.
This could be due to the cluster not having enough resources available.
You asked for 4 CPUs and 2 GPUs, but the cluster only 
has 4 CPUs and 1 GPUs available.
Stop the training and adjust the required resources (e.g. via the 
`ScalingConfig` or `resources_per_trial`, or `num_workers` for rllib), 
or add more resources to your cluster.

Maximum number of pending trials

The message

2023-03-28 13:51:14,756 WARNING trial_runner.py:1576 -- The maximum number of pending trials has been automatically set to the number of available cluster CPUs, which is high (176 CPUs/pending trials). If you're running an experiment with a large number of trials, this could lead to scheduling overhead. In this case, consider setting the `TUNE_MAX_PENDING_TRIALS_PG` environment variable to the desired maximum number of concurrent trials.

will not turn up for Ray Train runs anymore.

Restore

For Train runs, will now be:

Experiment has been interrupted, but the most recent state was saved.
Continue running this experiment with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

We can't get rid of all tune-related concepts just yet, but this is at least an improvement.

Related issue number

Closes #33839 (or parts of it)

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <[email protected]>

gjoliver

I like the error message refactoring.
want to discuss about the long term plan of how to differentiate between trainer and tuner though.

gjoliver · 2023-05-03T18:03:19Z

python/ray/train/base_trainer.py

+                trainable=trainable,
+                param_space=param_space,
+                run_config=self.run_config,
+                _trainer_api=True,


@justinvyu can we have a chat about adding parameters like _trainer_api in Tuner :)

I think we generally need a good idea how to pass this information.

To me it feels like there should be some time of context. We have different requirements for different ML jobs. Even rllib vs Train has different requirements (e.g. default metrics to show), and maybe even rllib's single algorithms.

We don't have that story, yet, so in order to unblock this work, I think we can go ahead with the private flags. But yes, we should resolve this (also for telemetry).

gjoliver · 2023-05-03T18:07:52Z

python/ray/tune/execution/trial_runner.py

@@ -133,12 +133,15 @@ def __init__(
        callbacks: Optional[List[Callback]] = None,
        metric: Optional[str] = None,
        trial_checkpoint_config: Optional[CheckpointConfig] = None,
+        _trainer_api: bool = False,


explanation of what this is?
can we compute whether there is only a single Trial ourselves?
I mean, feel like it would be nice to be able to avoid such a trainer parameter on Tuner init.

I've tried this before, and long story short, it's not very straightforward due to the fact that we need some of the information pretty early, but number of trials is only calculated later. It also can lead to confusing situations - e.g. it's totally valid to use Tuner(trainable, tune_config=TuneConfig(num_samples=1)) for iteration and still expect to see a Tuner.restore() hint at the end. Tracking which object actually was the entrypoint saves us from those problems.

ok, I just feel like passing such a generic sounding parameter all the way through so many components like Tuner, BackendExecutor, etc just to be able to show the right output message is kind of too heavy.
we should probably have a way out of this if we want to live with it for now. @justinvyu
another idea is maybe passing this bit through a LoggingConfig or something? we probably need such a config class anyways?

We're doing the same at the moment with _tuner_api, so in that sense it's consistent :-D

This is not a logging configuration in my opinion. Users should not "configure" which output/error messages they want to see. It's more of a runtime context.

Ray core has a runtime context object, I think we just need something similar for Ray AIR.

scottsun94 · 2023-05-04T16:05:28Z

Restore

For Train runs, will now be:> Restore

For Train runs, will now be:

Experiment has been interrupted, but the most recent state was saved.
Continue running this experiment with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

Is Experiment a concept for Train users? Should it just be

Training has been interrupted, but the most recent state was saved.
 Resume training with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

Other LGTM. Thanks!

Experiment has been interrupted, but the most recent state was saved.
Continue running this experiment with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

Is Experiment a concept for Train users? Should it just be

**Training** has been interrupted, but the most recent state was saved.
 **Resume training** with: Trainer.restore(path="/Users/kai/ray_results/debug_execution_restart_dpt7", ...)

Others LGTM. Thanks!

Signed-off-by: Kai Fricke <[email protected]>

krfricke · 2023-05-04T16:55:31Z

Updated, thanks!

# Conflicts: # python/ray/tune/impl/tuner_internal.py

gjoliver

sorry I missed this.
let's add a TODO and unblock the output work.

…e-tune-leaks

Signed-off-by: Kai Fricke <[email protected]>

) Ray Tune is the execution backend for Ray Train. This means that sometimes error/warning messages use Tune concepts, that don't make sense in a single-trial run, such as with Ray Train trainers. This PR improves three such occurrences: 1. The insufficient resources warnings message has been adjusted in the case where only one trial is run 2. Calculation of `max_pending_trials` now uses `search_alg.total_samples` as the minimum, which was an oversight before. 3. On interrupt of a training run, a `Tuner.restore()` message was suggested, but it should be `Trainer.restore()` Signed-off-by: Kai Fricke <[email protected]>

Kai Fricke added 2 commits May 3, 2023 12:01

Better insufficient resources msg

b86cf6d

Signed-off-by: Kai Fricke <[email protected]>

limit max_pending_trials to search_alg.total_samples

6612ac0

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested a review from scottsun94 May 3, 2023 11:10

krfricke assigned gjoliver May 3, 2023

krfricke mentioned this pull request May 3, 2023

[AIR output] Tune concepts are leaked when I only used a trainer #33839

Closed

Kai Fricke added 4 commits May 3, 2023 12:49

Pass trainer_api

d513cef

Signed-off-by: Kai Fricke <[email protected]>

max

32684ed

Signed-off-by: Kai Fricke <[email protected]>

msg

02480c9

Signed-off-by: Kai Fricke <[email protected]>

Update max pending again

32c5d3b

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested a review from gjoliver May 3, 2023 12:14

krfricke assigned scottsun94 May 3, 2023

auto init

7d79292

Signed-off-by: Kai Fricke <[email protected]>

krfricke added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label May 3, 2023

gjoliver requested changes May 3, 2023

View reviewed changes

krfricke mentioned this pull request May 4, 2023

[AIR output] "iteration" is shown in the output for RL users #34918

Open

Kai Fricke added 2 commits May 4, 2023 17:53

Merge branch 'master' into air/output/improve-tune-leaks

a9cbf50

training vs experiment

881975c

Signed-off-by: Kai Fricke <[email protected]>

Merge branch 'master' into air/output/improve-tune-leaks

df10274

# Conflicts: # python/ray/tune/impl/tuner_internal.py

gjoliver approved these changes May 7, 2023

View reviewed changes

Kai Fricke added 2 commits May 8, 2023 09:05

Merge remote-tracking branch 'upstream/master' into air/output/improv…

4b5b39a

…e-tune-leaks

Add comment

fb648e8

Signed-off-by: Kai Fricke <[email protected]>

krfricke merged commit 3c2d77e into ray-project:master May 8, 2023

krfricke deleted the air/output/improve-tune-leaks branch May 8, 2023 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air/output] Improve leaked mentions of Tune concepts #35003

[air/output] Improve leaked mentions of Tune concepts #35003

krfricke commented May 3, 2023 •

edited

Loading

gjoliver left a comment

gjoliver May 3, 2023

krfricke May 3, 2023

gjoliver May 3, 2023

krfricke May 3, 2023

gjoliver May 4, 2023

krfricke May 5, 2023 •

edited

Loading

scottsun94 commented May 4, 2023 •

edited

Loading

krfricke commented May 4, 2023

gjoliver left a comment

[air/output] Improve leaked mentions of Tune concepts #35003

[air/output] Improve leaked mentions of Tune concepts #35003

Conversation

krfricke commented May 3, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

gjoliver left a comment

Choose a reason for hiding this comment

gjoliver May 3, 2023

Choose a reason for hiding this comment

krfricke May 3, 2023

Choose a reason for hiding this comment

gjoliver May 3, 2023

Choose a reason for hiding this comment

krfricke May 3, 2023

Choose a reason for hiding this comment

gjoliver May 4, 2023

Choose a reason for hiding this comment

krfricke May 5, 2023 • edited Loading

Choose a reason for hiding this comment

scottsun94 commented May 4, 2023 • edited Loading

krfricke commented May 4, 2023

gjoliver left a comment

Choose a reason for hiding this comment

krfricke commented May 3, 2023 •

edited

Loading

krfricke May 5, 2023 •

edited

Loading

scottsun94 commented May 4, 2023 •

edited

Loading