-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Avoid scheduler blocking, add reuse_actors optimization #4218
Conversation
@arcelien please try this out |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple comments.
if not reset_successful: | ||
if reset_successful: | ||
trial_executor.restore( | ||
trial, Checkpoint.from_object(new_state.last_checkpoint)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arcelien is this ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try it on a single gpu machine both with time mutliplexing and also with a small population size.
@richardliaw @arcelien I just pushed a change that should drastically speed up time-multiplexing as well, by reusing actors across different trials. This is a bit of a scary change so I marked the PR as WIP until we're sure this is safe. |
if not error and self._cached_runner is None: | ||
logger.debug("Retaining trial runner {}".format( | ||
trial.runner)) | ||
self._cached_runner = trial.runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect GPU trials may run into issues when reusing same processes because TF doesn't give up the GPU -- unless you've observed otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no difference between reusing a runner, that retains control of the GPU, and stopping the runner and starting a new one (release / reacquire the GPU), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I guess my question is mainly what happens during Trainable._setup()
, which actually isn't called in _setup_runner
when we're using a cached runner (see above note)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we call reset_trial() instead. I guess this works for PBT, but not necesssarily for other algorithms. Unless we add a reset_state() function as well?
logger.debug("Reusing cached runner {}".format( | ||
self._cached_runner)) | ||
existing_runner = self._cached_runner | ||
self._cached_runner = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're processing a new trial and the trial resources are different, we can't just used the _cached_runner right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I think we can probably assume they are the same though, if the reuse_actors flag is manually activated.
python/ray/tune/trainable.py
Outdated
@@ -344,15 +344,21 @@ def export_model(self, export_formats, export_dir=None): | |||
export_dir = export_dir or self.logdir | |||
return self._export_model(export_formats, export_dir) | |||
|
|||
def reset_config(self, new_config): | |||
def reset_config(self, new_config, reset_state): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking API change. Though I doubt reset_config() is widely used.
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Updated |
Test FAILed. |
@@ -310,6 +310,9 @@ def restore(self, checkpoint_path): | |||
self._restore(checkpoint_dict) | |||
else: | |||
self._restore(checkpoint_path) | |||
self._time_since_restore = 0.0 | |||
self._timesteps_since_restore = 0 | |||
self._iterations_since_restore = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
BTW, all Travis tests hang on |
Test FAILed. |
Test FAILed. |
Test FAILed. |
jenkins retest this please |
Test PASSed. |
Test FAILed. |
jenkins retest this please |
Test PASSed. |
Test FAILed. |
What do these changes do?
The key change is removing the ray.get here:
This avoids blocking the PBT scheduler when restoring trials. This is a large performance bottleneck when restoring very large network weights.
Also, add warnings if fast ray.get paths are ever slow, and warn the user if they didn't implement reset_config(). Incidentally, I think we forgot to restore weights on the reset_config()=True path.
Finally, add a
reuse_actors
flag that allows actors to be reused across trials ifreset_config
is implemented. This provides additional speedups if actor creation is slow.