-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Change the log syncing behavior #4450
[tune] Change the log syncing behavior #4450
Conversation
Can one of the admins verify this patch? |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
doc/source/tune-usage.rst
Outdated
@@ -259,7 +259,7 @@ of a trial, you can additionally set the checkpoint_at_end to True. An example i | |||
Recovering From Failures (Experimental) | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
Tune automatically persists the progress of your experiments, so if an experiment crashes or is otherwise cancelled, it can be resumed with ``resume=True``. The default setting of ``resume=False`` creates a new experiment, and ``resume="prompt"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name. | |||
Tune automatically persists the progress of your experiments, so if an experiment crashes or is otherwise cancelled, it can be resumed by passing one of True, False, "LOCAL", "REMOTE", or "PROMPT" to ``tune.run(resume=...)``. The default setting of ``resume=False`` creates a new experiment. ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``. ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restore the experiment from ``local_dir/experiment_name``. ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tune automatically persists the progress of your experiments, so if an experiment crashes or is otherwise cancelled, it can be resumed by passing one of True, False, "LOCAL", "REMOTE", or "PROMPT" to ``tune.run(resume=...)``. The default setting of ``resume=False`` creates a new experiment. ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``. ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restore the experiment from ``local_dir/experiment_name``. ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name. | |
Tune automatically persists the progress of your experiments, so if an experiment crashes or is otherwise cancelled, it can be resumed by passing one of True, False, "LOCAL", "REMOTE", or "PROMPT" to ``tune.run(resume=...)``. The default setting of ``resume=False`` creates a new experiment. ``resume="LOCAL"`` and ``resume=True`` restore the experiment from ``local_dir/[experiment_name]``. ``resume="REMOTE"`` syncs the upload dir down to the local dir and then restores the experiment from ``local_dir/experiment_name``. ``resume="PROMPT"`` will cause Tune to prompt you for whether you want to resume. You can always force a new experiment to be created by changing the experiment name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
Looks good to me! |
Test PASSed. |
Awesome! |
python/ray/tune/tune.py
Outdated
else: | ||
logger.info("Tip: to resume incomplete experiments, " | ||
"pass resume='prompt' or resume=True to run()") | ||
def _get_resume_path(local_checkpoint_dir, remote_checkpoint_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is extraneous
python/ray/tune/trial_runner.py
Outdated
def _validate_resume(self, resume_type): | ||
""" | ||
Args: | ||
resume_type: One of "REMOTE", "LOCAL", "PROMPT". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resume_type: One of "REMOTE", "LOCAL", "PROMPT". | |
resume_type: One of "REMOTE", "LOCAL", True, "PROMPT". |
python/ray/tune/trial_runner.py
Outdated
self._metadata_checkpoint_dir = metadata_checkpoint_dir | ||
self._local_checkpoint_dir = local_checkpoint_dir | ||
|
||
# TODO(rliaw): This may fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# TODO(rliaw): This may fail |
python/ray/tune/syncer.py
Outdated
Args: | ||
local_dir: Source directory for syncing. | ||
remote_dir: Target directory for syncing. If None, | ||
returns NoopSyncer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returns NoopSyncer. | |
returns BaseSyncer with a noop. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
What do these changes do?
Refactor the log sync behavior.
TODOs:
Fix up Trial syncing (remove remote capabilities from Trial?)
sync_function
for remote to driver syncing. This may require a bit of restructuring to LogSyncing as a mixin.Write Tests:
test_cluster.py
). - Punting on this one because it is captured in e2e ft test.os.path.expanduser
on remote node (this is done in [tune] Later expansion of local_dir #4806)Docs
DeprecationWarning
Tests
Related issue number
@richardliaw