[tune] Change the log syncing behavior #4450

hartikainen · 2019-03-21T22:58:19Z

What do these changes do?

Refactor the log sync behavior.

TODOs:

Related issue number

@richardliaw

AmplabJenkins · 2019-03-21T22:58:39Z

Can one of the admins verify this patch?

AmplabJenkins · 2019-03-21T23:39:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13169/
Test FAILed.

AmplabJenkins · 2019-04-04T20:42:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13545/
Test FAILed.

AmplabJenkins · 2019-04-05T00:10:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13575/
Test FAILed.

AmplabJenkins · 2019-04-05T00:10:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13576/
Test FAILed.

AmplabJenkins · 2019-04-30T23:18:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14020/
Test FAILed.

AmplabJenkins · 2019-05-01T00:41:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14021/
Test FAILed.

AmplabJenkins · 2019-05-16T21:22:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14254/
Test FAILed.

AmplabJenkins · 2019-05-16T22:34:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14260/
Test FAILed.

AmplabJenkins · 2019-05-16T22:57:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14262/
Test FAILed.

AmplabJenkins · 2019-05-16T23:03:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14264/
Test FAILed.

AmplabJenkins · 2019-05-17T01:03:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14274/
Test FAILed.

AmplabJenkins · 2019-07-01T19:18:50Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15004/
Test PASSed.

AmplabJenkins · 2019-07-01T19:27:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15003/
Test PASSed.

hartikainen · 2019-07-02T03:18:03Z

doc/source/tune-usage.rst

@@ -259,7 +259,7 @@ of a trial, you can additionally set the checkpoint_at_end to True. An example i
 Recovering From Failures (Experimental)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-Tune automatically persists the progress of your experiments, so if an experiment crashes or is otherwise cancelled, it can be resumed with ``resume=True``. The default setting of ``resume=False`` creates a new experiment, and ``resume="prompt"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name.
+Tune automatically persists the progress of your experiments, so if an experiment crashes or is otherwise cancelled, it can be resumed by passing one of True, False, "LOCAL", "REMOTE", or "PROMPT" to ``tune.run(resume=...)``. The default setting of ``resume=False`` creates a new experiment. ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``. ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restore the experiment from ``local_dir/experiment_name``. ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name.


Suggested change

Tune automatically persists the progress of your experiments, so if an experiment crashes or is otherwise cancelled, it can be resumed by passing one of True, False, "LOCAL", "REMOTE", or "PROMPT" to ``tune.run(resume=...)``. The default setting of ``resume=False`` creates a new experiment. ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``. ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restore the experiment from ``local_dir/experiment_name``. ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name.

Tune automatically persists the progress of your experiments, so if an experiment crashes or is otherwise cancelled, it can be resumed by passing one of True, False, "LOCAL", "REMOTE", or "PROMPT" to ``tune.run(resume=...)``. The default setting of ``resume=False`` creates a new experiment. ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``. ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restores the experiment from ``local_dir/experiment_name``. ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name.

hartikainen · 2019-07-02T05:28:12Z

Looks good to me!

AmplabJenkins · 2019-07-02T11:59:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15023/
Test PASSed.

richardliaw · 2019-07-02T19:05:47Z

Awesome!

richardliaw · 2019-07-02T19:07:52Z

python/ray/tune/tune.py

-        else:
-            logger.info("Tip: to resume incomplete experiments, "
-                        "pass resume='prompt' or resume=True to run()")
+def _get_resume_path(local_checkpoint_dir, remote_checkpoint_dir):


this is extraneous

richardliaw · 2019-07-02T19:09:17Z

python/ray/tune/trial_runner.py

+    def _validate_resume(self, resume_type):
+        """
+        Args:
+            resume_type: One of "REMOTE", "LOCAL", "PROMPT".


Suggested change

resume_type: One of "REMOTE", "LOCAL", "PROMPT".

resume_type: One of "REMOTE", "LOCAL", True, "PROMPT".

richardliaw · 2019-07-02T19:09:49Z

python/ray/tune/trial_runner.py

-        self._metadata_checkpoint_dir = metadata_checkpoint_dir
+        self._local_checkpoint_dir = local_checkpoint_dir
+
+        # TODO(rliaw): This may fail


Suggested change

# TODO(rliaw): This may fail

richardliaw · 2019-07-02T19:12:28Z

python/ray/tune/syncer.py

+    Args:
+        local_dir: Source directory for syncing.
+        remote_dir: Target directory for syncing. If None,
+            returns NoopSyncer.


Suggested change

returns NoopSyncer.

returns BaseSyncer with a noop.

…into bunch-of-log-sync-fixes

AmplabJenkins · 2019-07-02T21:58:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15030/
Test PASSed.

AmplabJenkins · 2019-07-02T22:21:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15032/
Test FAILed.

AmplabJenkins · 2019-07-03T01:45:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15040/
Test FAILed.

Change the log syncing behavior

6527abf

Merge branch 'master' into bunch-of-log-sync-fixes

e760991

richardliaw added 3 commits April 4, 2019 15:03

fix up abstractions for syncer

2ada6db

Finished checkpoint syncing

26fe09b

Code

f45110d

Set of changes to get things running

ee5c61d

Fixes for log syncing

045bfa4

richardliaw self-assigned this May 2, 2019

richardliaw mentioned this pull request May 10, 2019

[tune] tf.summary.FileWriter extensibility for custom TensorBoard metrics #4762

Closed

Merge branch 'master' into bunch-of-log-sync-fixes

c5b1731

richardliaw added 4 commits May 16, 2019 15:03

Fix parts

7d7ced1

Merge branch 'tune-submit-fix' into bunch-of-log-sync-fixes

5ce47d7

Lint and other fixes

979a04c

fix some test

91dad93

richardliaw added 2 commits May 16, 2019 15:40

Remove extra parsing functionality

e3ecc72

Merge branch 'tune-relax-configs' into bunch-of-log-sync-fixes

26a538f

richardliaw added 2 commits May 16, 2019 17:23

some test fixes

b0f6218

Fix up cloud syncing

5ca8eca

richardliaw added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 1, 2019

richardliaw added 2 commits July 1, 2019 09:25

Update test_cluster.py

fdebff3

betterdoc

cd135d7

hartikainen commented Jul 2, 2019

View reviewed changes

Update tune-usage.rst

652050f

richardliaw approved these changes Jul 2, 2019

View reviewed changes

richardliaw reviewed Jul 2, 2019

View reviewed changes

richardliaw added 4 commits July 2, 2019 12:19

cleanup

94e3cac

Merge branch 'bunch-of-log-sync-fixes' of github.com:hartikainen/ray …

6243211

…into bunch-of-log-sync-fixes

Merge branch 'master' into bunch-of-log-sync-fixes

d3cefa8

nit

f825ec0

richardliaw merged commit 9e0192b into ray-project:master Jul 3, 2019

hartikainen deleted the bunch-of-log-sync-fixes branch July 3, 2019 04:36

lscheinkman added a commit to lscheinkman/nupic.research that referenced this pull request Jan 8, 2021

Fix ray.tune api change. See ray-project/ray#4450

3699497

lscheinkman mentioned this pull request Jan 8, 2021

Fix ray.tune api change numenta/nupic.research#432

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Change the log syncing behavior #4450

[tune] Change the log syncing behavior #4450

hartikainen commented Mar 21, 2019 •

edited by richardliaw

Loading

AmplabJenkins commented Mar 21, 2019

AmplabJenkins commented Mar 21, 2019

AmplabJenkins commented Apr 4, 2019

AmplabJenkins commented Apr 5, 2019

AmplabJenkins commented Apr 5, 2019

AmplabJenkins commented Apr 30, 2019

AmplabJenkins commented May 1, 2019

AmplabJenkins commented May 16, 2019

AmplabJenkins commented May 16, 2019

AmplabJenkins commented May 16, 2019

AmplabJenkins commented May 16, 2019

AmplabJenkins commented May 17, 2019

AmplabJenkins commented Jul 1, 2019

AmplabJenkins commented Jul 1, 2019

hartikainen Jul 2, 2019

richardliaw Jul 2, 2019

hartikainen commented Jul 2, 2019

AmplabJenkins commented Jul 2, 2019

richardliaw commented Jul 2, 2019

richardliaw Jul 2, 2019

richardliaw Jul 2, 2019

richardliaw Jul 2, 2019

richardliaw Jul 2, 2019

AmplabJenkins commented Jul 2, 2019

AmplabJenkins commented Jul 2, 2019

AmplabJenkins commented Jul 3, 2019

	resume_type: One of "REMOTE", "LOCAL", "PROMPT".
	resume_type: One of "REMOTE", "LOCAL", True, "PROMPT".

[tune] Change the log syncing behavior #4450

[tune] Change the log syncing behavior #4450

Conversation

hartikainen commented Mar 21, 2019 • edited by richardliaw Loading

What do these changes do?

TODOs:

Related issue number

AmplabJenkins commented Mar 21, 2019

AmplabJenkins commented Mar 21, 2019

AmplabJenkins commented Apr 4, 2019

AmplabJenkins commented Apr 5, 2019

AmplabJenkins commented Apr 5, 2019

AmplabJenkins commented Apr 30, 2019

AmplabJenkins commented May 1, 2019

AmplabJenkins commented May 16, 2019

AmplabJenkins commented May 16, 2019

AmplabJenkins commented May 16, 2019

AmplabJenkins commented May 16, 2019

AmplabJenkins commented May 17, 2019

AmplabJenkins commented Jul 1, 2019

AmplabJenkins commented Jul 1, 2019

hartikainen Jul 2, 2019

Choose a reason for hiding this comment

richardliaw Jul 2, 2019

Choose a reason for hiding this comment

hartikainen commented Jul 2, 2019

AmplabJenkins commented Jul 2, 2019

richardliaw commented Jul 2, 2019

richardliaw Jul 2, 2019

Choose a reason for hiding this comment

richardliaw Jul 2, 2019

Choose a reason for hiding this comment

richardliaw Jul 2, 2019

Choose a reason for hiding this comment

richardliaw Jul 2, 2019

Choose a reason for hiding this comment

AmplabJenkins commented Jul 2, 2019

AmplabJenkins commented Jul 2, 2019

AmplabJenkins commented Jul 3, 2019

hartikainen commented Mar 21, 2019 •

edited by richardliaw

Loading