Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Avoid scheduler blocking, add reuse_actors optimization #4218

Merged
merged 17 commits into from
Mar 13, 2019

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Mar 2, 2019

What do these changes do?

The key change is removing the ray.get here:

                ray.get(trial.runner.restore_from_object.remote(value))	

This avoids blocking the PBT scheduler when restoring trials. This is a large performance bottleneck when restoring very large network weights.

Also, add warnings if fast ray.get paths are ever slow, and warn the user if they didn't implement reset_config(). Incidentally, I think we forgot to restore weights on the reset_config()=True path.

Finally, add a reuse_actors flag that allows actors to be reused across trials if reset_config is implemented. This provides additional speedups if actor creation is slow.

@ericl
Copy link
Contributor Author

ericl commented Mar 2, 2019

@arcelien please try this out

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12458/
Test FAILed.

python/ray/tune/util.py Outdated Show resolved Hide resolved
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments.

if not reset_successful:
if reset_successful:
trial_executor.restore(
trial, Checkpoint.from_object(new_state.last_checkpoint))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arcelien is this ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try it on a single gpu machine both with time mutliplexing and also with a small population size.

@ericl ericl changed the title [tune] Avoid blocking PBT scheduler when restoring trials [WIP] [tune] Avoid blocking PBT scheduler when restoring trials Mar 3, 2019
@ericl
Copy link
Contributor Author

ericl commented Mar 3, 2019

@richardliaw @arcelien I just pushed a change that should drastically speed up time-multiplexing as well, by reusing actors across different trials. This is a bit of a scary change so I marked the PR as WIP until we're sure this is safe.

if not error and self._cached_runner is None:
logger.debug("Retaining trial runner {}".format(
trial.runner))
self._cached_runner = trial.runner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect GPU trials may run into issues when reusing same processes because TF doesn't give up the GPU -- unless you've observed otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no difference between reusing a runner, that retains control of the GPU, and stopping the runner and starting a new one (release / reacquire the GPU), right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I guess my question is mainly what happens during Trainable._setup(), which actually isn't called in _setup_runner when we're using a cached runner (see above note)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we call reset_trial() instead. I guess this works for PBT, but not necesssarily for other algorithms. Unless we add a reset_state() function as well?

logger.debug("Reusing cached runner {}".format(
self._cached_runner))
existing_runner = self._cached_runner
self._cached_runner = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're processing a new trial and the trial resources are different, we can't just used the _cached_runner right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I think we can probably assume they are the same though, if the reuse_actors flag is manually activated.

@ericl ericl changed the title [WIP] [tune] Avoid blocking PBT scheduler when restoring trials [tune] Avoid blocking PBT scheduler when restoring trials Mar 3, 2019
@@ -344,15 +344,21 @@ def export_model(self, export_formats, export_dir=None):
export_dir = export_dir or self.logdir
return self._export_model(export_formats, export_dir)

def reset_config(self, new_config):
def reset_config(self, new_config, reset_state):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breaking API change. Though I doubt reset_config() is widely used.

@ericl ericl changed the title [tune] Avoid blocking PBT scheduler when restoring trials [tune] Avoid scheduler blocking, add reuse_actors optimization Mar 3, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12489/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12488/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12490/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12491/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12493/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12512/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12518/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Mar 7, 2019

Updated

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12654/
Test FAILed.

@@ -310,6 +310,9 @@ def restore(self, checkpoint_path):
self._restore(checkpoint_dict)
else:
self._restore(checkpoint_path)
self._time_since_restore = 0.0
self._timesteps_since_restore = 0
self._iterations_since_restore = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@richardliaw
Copy link
Contributor

BTW, all Travis tests hang on python/ray/tune/tests/test_actor_reuse.py::ActorReuseTest::testTrialReuseDisabled

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12732/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12737/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/9/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Mar 13, 2019

jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/64/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12821/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Mar 13, 2019

jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/69/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12831/
Test FAILed.

@ericl ericl merged commit d5f4698 into ray-project:master Mar 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants