-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air/output] Improve leaked mentions of Tune concepts #35003
[air/output] Improve leaked mentions of Tune concepts #35003
Conversation
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the error message refactoring.
want to discuss about the long term plan of how to differentiate between trainer and tuner though.
trainable=trainable, | ||
param_space=param_space, | ||
run_config=self.run_config, | ||
_trainer_api=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinvyu can we have a chat about adding parameters like _trainer_api
in Tuner :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we generally need a good idea how to pass this information.
To me it feels like there should be some time of context. We have different requirements for different ML jobs. Even rllib vs Train has different requirements (e.g. default metrics to show), and maybe even rllib's single algorithms.
We don't have that story, yet, so in order to unblock this work, I think we can go ahead with the private flags. But yes, we should resolve this (also for telemetry).
@@ -133,12 +133,15 @@ def __init__( | |||
callbacks: Optional[List[Callback]] = None, | |||
metric: Optional[str] = None, | |||
trial_checkpoint_config: Optional[CheckpointConfig] = None, | |||
_trainer_api: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation of what this is?
can we compute whether there is only a single Trial ourselves?
I mean, feel like it would be nice to be able to avoid such a trainer parameter on Tuner init.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried this before, and long story short, it's not very straightforward due to the fact that we need some of the information pretty early, but number of trials is only calculated later. It also can lead to confusing situations - e.g. it's totally valid to use Tuner(trainable, tune_config=TuneConfig(num_samples=1))
for iteration and still expect to see a Tuner.restore()
hint at the end. Tracking which object actually was the entrypoint saves us from those problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I just feel like passing such a generic sounding parameter all the way through so many components like Tuner, BackendExecutor, etc just to be able to show the right output message is kind of too heavy.
we should probably have a way out of this if we want to live with it for now. @justinvyu
another idea is maybe passing this bit through a LoggingConfig or something? we probably need such a config class anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're doing the same at the moment with _tuner_api
, so in that sense it's consistent :-D
This is not a logging configuration in my opinion. Users should not "configure" which output/error messages they want to see. It's more of a runtime context.
Ray core has a runtime context object, I think we just need something similar for Ray AIR.
Is
Other LGTM. Thanks!
Is
Others LGTM. Thanks! |
Signed-off-by: Kai Fricke <[email protected]>
Updated, thanks! |
# Conflicts: # python/ray/tune/impl/tuner_internal.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I missed this.
let's add a TODO and unblock the output work.
Signed-off-by: Kai Fricke <[email protected]>
) Ray Tune is the execution backend for Ray Train. This means that sometimes error/warning messages use Tune concepts, that don't make sense in a single-trial run, such as with Ray Train trainers. This PR improves three such occurrences: 1. The insufficient resources warnings message has been adjusted in the case where only one trial is run 2. Calculation of `max_pending_trials` now uses `search_alg.total_samples` as the minimum, which was an oversight before. 3. On interrupt of a training run, a `Tuner.restore()` message was suggested, but it should be `Trainer.restore()` Signed-off-by: Kai Fricke <[email protected]>
Why are these changes needed?
Ray Tune is the execution backend for Ray Train. This means that sometimes error/warning messages use Tune concepts, that don't make sense in a single-trial run, such as with Ray Train trainers.
This PR improves three such occurrences:
max_pending_trials
now usessearch_alg.total_samples
as the minimum, which was an oversight before.Tuner.restore()
message was suggested, but it should beTrainer.restore()
With these fixes, we will see message such as:
Insufficient resources
Maximum number of pending trials
The message
will not turn up for Ray Train runs anymore.
Restore
For Train runs, will now be:
We can't get rid of all tune-related concepts just yet, but this is at least an improvement.
Related issue number
Closes #33839 (or parts of it)
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.