[tune] Avoid scheduler blocking, add reuse_actors optimization #4218

ericl · 2019-03-02T04:32:59Z

What do these changes do?

The key change is removing the ray.get here:

                ray.get(trial.runner.restore_from_object.remote(value))

This avoids blocking the PBT scheduler when restoring trials. This is a large performance bottleneck when restoring very large network weights.

Also, add warnings if fast ray.get paths are ever slow, and warn the user if they didn't implement reset_config(). Incidentally, I think we forgot to restore weights on the reset_config()=True path.

Finally, add a reuse_actors flag that allows actors to be reused across trials if reset_config is implemented. This provides additional speedups if actor creation is slow.

ericl · 2019-03-02T04:33:14Z

@arcelien please try this out

AmplabJenkins · 2019-03-02T07:55:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12458/
Test FAILed.

python/ray/tune/util.py

python/ray/tune/schedulers/pbt.py

richardliaw

Left a couple comments.

richardliaw · 2019-03-03T05:12:59Z

python/ray/tune/schedulers/pbt.py

-        if not reset_successful:
+        if reset_successful:
+            trial_executor.restore(
+                trial, Checkpoint.from_object(new_state.last_checkpoint))


@arcelien is this ok?

Let me try it on a single gpu machine both with time mutliplexing and also with a small population size.

ericl · 2019-03-03T05:31:44Z

@richardliaw @arcelien I just pushed a change that should drastically speed up time-multiplexing as well, by reusing actors across different trials. This is a bit of a scary change so I marked the PR as WIP until we're sure this is safe.

richardliaw · 2019-03-03T05:36:13Z

python/ray/tune/ray_trial_executor.py

+                if not error and self._cached_runner is None:
+                    logger.debug("Retaining trial runner {}".format(
+                        trial.runner))
+                    self._cached_runner = trial.runner


I suspect GPU trials may run into issues when reusing same processes because TF doesn't give up the GPU -- unless you've observed otherwise?

There's no difference between reusing a runner, that retains control of the GPU, and stopping the runner and starting a new one (release / reacquire the GPU), right?

oh I guess my question is mainly what happens during Trainable._setup(), which actually isn't called in _setup_runner when we're using a cached runner (see above note)

Right, we call reset_trial() instead. I guess this works for PBT, but not necesssarily for other algorithms. Unless we add a reset_state() function as well?

richardliaw · 2019-03-03T05:50:52Z

python/ray/tune/ray_trial_executor.py

+            logger.debug("Reusing cached runner {}".format(
+                self._cached_runner))
+            existing_runner = self._cached_runner
+            self._cached_runner = None


if we're processing a new trial and the trial resources are different, we can't just used the _cached_runner right?

That's right. I think we can probably assume they are the same though, if the reuse_actors flag is manually activated.

ericl · 2019-03-03T06:17:38Z

python/ray/tune/trainable.py

@@ -344,15 +344,21 @@ def export_model(self, export_formats, export_dir=None):
        export_dir = export_dir or self.logdir
        return self._export_model(export_formats, export_dir)

-    def reset_config(self, new_config):
+    def reset_config(self, new_config, reset_state):


Breaking API change. Though I doubt reset_config() is widely used.

AmplabJenkins · 2019-03-03T07:11:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12489/
Test FAILed.

AmplabJenkins · 2019-03-03T07:23:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12488/
Test FAILed.

AmplabJenkins · 2019-03-03T07:42:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12490/
Test FAILed.

AmplabJenkins · 2019-03-03T08:31:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12491/
Test FAILed.

AmplabJenkins · 2019-03-03T09:47:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12493/
Test FAILed.

AmplabJenkins · 2019-03-04T01:29:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12512/
Test FAILed.

AmplabJenkins · 2019-03-04T04:03:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12518/
Test FAILed.

python/ray/tune/error.py

ericl · 2019-03-07T06:01:06Z

Updated

AmplabJenkins · 2019-03-07T12:07:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12654/
Test FAILed.

richardliaw · 2019-03-08T01:15:29Z

python/ray/tune/trainable.py

@@ -310,6 +310,9 @@ def restore(self, checkpoint_path):
            self._restore(checkpoint_dict)
        else:
            self._restore(checkpoint_path)
+        self._time_since_restore = 0.0
+        self._timesteps_since_restore = 0
+        self._iterations_since_restore = 0


richardliaw

Looks good!

richardliaw · 2019-03-08T19:05:19Z

BTW, all Travis tests hang on python/ray/tune/tests/test_actor_reuse.py::ActorReuseTest::testTrialReuseDisabled

AmplabJenkins · 2019-03-10T06:30:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12732/
Test FAILed.

AmplabJenkins · 2019-03-11T03:01:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12737/
Test FAILed.

AmplabJenkins · 2019-03-11T21:48:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/9/
Test FAILed.

ericl · 2019-03-13T00:33:56Z

jenkins retest this please

AmplabJenkins · 2019-03-13T00:38:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/64/
Test PASSed.

AmplabJenkins · 2019-03-13T01:28:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12821/
Test FAILed.

ericl · 2019-03-13T03:25:18Z

jenkins retest this please

AmplabJenkins · 2019-03-13T03:27:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/69/
Test PASSed.

AmplabJenkins · 2019-03-13T06:33:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/12831/
Test FAILed.

fpbt

b7331f2

ericl assigned richardliaw Mar 2, 2019