Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune/train] Consolidate checkpoint manager 3: Ray Tune #24430

Merged
merged 109 commits into from
Jun 8, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
180c3b9
[tune/train] Consolidate checkpoint manager 1: Common checkpoint mana…
May 13, 2022
743ee43
[tune/train] Consolidate checkpoint manager 2: Ray Train
May 13, 2022
cfeefea
WIP
Apr 6, 2022
bd0858d
Continue consolidation
May 3, 2022
50525c8
Remove `_TuneCheckpoint` usages
May 3, 2022
5cdb304
test_checkpoint_manager.py is passing
May 3, 2022
e8da5b6
Fix trainer cp manager init
May 3, 2022
4f9ace7
Default empty memory checkpoint
May 4, 2022
114b5b9
Default train checkpoint strategy
May 4, 2022
c4eb5f2
checkpoint.value --> checkpoint.checkpoint_dir_or_data
May 4, 2022
2c67165
init latest checkpoint id
May 4, 2022
8a37de4
train latest_checkpoint property
May 4, 2022
5b218cd
Keep top checkpoints when keep_num is None
May 4, 2022
ef489c2
Fix train checkpoint bookkeeping and serialization
May 4, 2022
d2d7aae
Default checkpoint id
May 4, 2022
ea9eb50
Fix TuneCheckpointManager
May 4, 2022
74f2928
Fix latest checkpoint id increment
May 4, 2022
bb4ede2
Update delete fn
May 4, 2022
a92044e
Delete fn should be property of checkpoint manager, not TrackedCheckp…
May 5, 2022
5edffe5
fix train checkpoint deletion
May 5, 2022
01a6808
Clear data on commit
May 5, 2022
98aff7d
Pre-review
May 5, 2022
cba0450
training iteration -> timestamp
May 5, 2022
0363b4d
Update python/ray/util/ml_utils/checkpoint_manager.py
krfricke May 6, 2022
388ddde
Rename dataclass attributes
May 6, 2022
e566c8f
Update python/ray/train/checkpoint.py
krfricke May 6, 2022
3b72763
Default checkpoint score attr to None
May 9, 2022
a736f07
_TrackedCheckpoint -> TrackedCheckpoint
May 9, 2022
3081bae
_TrackedCheckpoint -> TrackedCheckpoint
May 9, 2022
ed5d68d
Adapt changes
May 9, 2022
9159bd6
Default checkpoint strategy
May 9, 2022
27e27ab
Error handling for delete fn
May 12, 2022
fb2d95e
Main entrypoint
May 13, 2022
51c94b3
Merge branch 'master' into tune-train/checkpoints-base
May 17, 2022
29ad103
Add general entrypoint
May 17, 2022
cb5eed3
[tune/train] Consolidate checkpoint manager 2: Ray Train
May 13, 2022
0e6c420
Adjust to changes in base PR
May 17, 2022
fa05c31
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
May 17, 2022
a8928aa
Adjust to changes in base PR
May 17, 2022
2bd5f75
[tune/train] Consolidate checkpoint manager 1: Common checkpoint mana…
May 13, 2022
eb3a658
Add general entrypoint
May 17, 2022
886157b
Fix faulty rebase
May 17, 2022
8a99f69
[tune/train] Consolidate checkpoint manager 2: Ray Train
May 13, 2022
7e48c0b
Adjust to changes in base PR
May 17, 2022
f4c234b
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
May 17, 2022
51fa584
Fix faulty rebase
May 17, 2022
1631805
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
May 17, 2022
731aef3
Fix faulty rebase
May 17, 2022
c76e0b1
Rename results - metrics
May 17, 2022
cdb9d49
fix delete fn
May 17, 2022
6ce138b
[tune/train] Consolidate checkpoint manager 1: Common checkpoint mana…
May 13, 2022
06fc42d
[tune/train] Consolidate checkpoint manager 2: Ray Train
May 13, 2022
bba91f3
Adjust to changes in base PR
May 17, 2022
28d00eb
Fix faulty rebase
May 17, 2022
fae9a54
[tune/train] Consolidate checkpoint manager 2: Ray Train
May 13, 2022
35f7b08
Undo changes
May 17, 2022
dd7f796
Restore common entrypoint
May 17, 2022
6e97532
Merge branch 'tune-train/checkpoints-base' into tune-train/checkpoint…
May 17, 2022
8933658
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
May 17, 2022
ca4c671
Adjust to changes in base PR
May 17, 2022
18e2ace
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
May 17, 2022
918c214
Do not persist memory checkpoints
May 17, 2022
7cf2645
Use enum
May 17, 2022
4a54cbe
Merge branch 'tune-train/checkpoints-base' into tune-train/checkpoint…
May 17, 2022
88f9fed
Use enum
May 17, 2022
f445438
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
May 17, 2022
59264ef
Use enum
May 17, 2022
839c7ab
Add tests
May 17, 2022
048e667
Re-order
May 17, 2022
4940d45
Merge branch 'tune-train/checkpoints-base' into tune-train/checkpoint…
May 17, 2022
4af1de7
Checkpoints are now directories, not files
May 17, 2022
50c155a
Merge branch 'master' into tune-train/checkpoints-train
May 25, 2022
47bccdd
Privatize _checkpoint.py
May 25, 2022
a0be76b
Privatize _checkpoint.py
May 25, 2022
7ec26ca
Rename variables
May 25, 2022
2d88325
Merge branch 'master' into tune-train/checkpoints-train
May 26, 2022
d1c5ddb
Optional init parameters
May 26, 2022
5eadbcd
Update DP trainer / HF trainer
May 27, 2022
e13da70
Merge branch 'master' into tune-train/checkpoints-train
Jun 2, 2022
58f6ee8
lint
Jun 2, 2022
5a88948
Fix tune tests
Jun 2, 2022
80e7c77
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 2, 2022
23948ff
Fix some merge issues
Jun 2, 2022
8c93459
Tune checkpoint manager
Jun 2, 2022
388c96a
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 2, 2022
b8b944d
Allow None values for memory checkpoints
Jun 2, 2022
60eb3a2
Fix huggingface trainer
Jun 2, 2022
a0b3109
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 2, 2022
27531a5
Merge remote-tracking branch 'upstream/master' into tune-train/checkp…
Jun 3, 2022
39c7f28
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 3, 2022
402e1ee
result -> metrics
Jun 3, 2022
df19b8a
Merge branch 'master' into tune-train/checkpoints-train
Jun 3, 2022
ef5a4e7
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 3, 2022
9f6dca4
ml -> air
Jun 3, 2022
8d8a9ca
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 3, 2022
9257914
Merge branch 'master' into tune-train/checkpoints-train
Jun 6, 2022
228dd27
Merge conflicts
Jun 6, 2022
9d984e0
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 6, 2022
c92f293
Fix CheckpointStorage
Jun 6, 2022
5411958
Fix tests with strings in memory checkpoints
Jun 6, 2022
73ea9e5
Address comments
Jun 7, 2022
a749722
Merge remote-tracking branch 'upstream/master' into tune-train/checkp…
Jun 7, 2022
4c55e4b
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 7, 2022
fceb5c3
Install pandas in minimal env
Jun 7, 2022
c79e5d3
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 7, 2022
5a670b1
Type checking import
Jun 7, 2022
6c0b077
Merge branch 'tune-train/checkpoints-train' into tune-train/checkpoints
Jun 7, 2022
74cabb2
Fix example
Jun 7, 2022
19562c9
Merge remote-tracking branch 'upstream/master' into tune-train/checkp…
Jun 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/tune/api_docs/trainable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ You can restore a single trial checkpoint by using ``tune.run(restore=<checkpoin
"max_iter": 5
},
).trials
last_ckpt = trial.checkpoint.value
last_ckpt = trial.checkpoint.dir_or_data
analysis = tune.run(train, config={"max_iter": 10}, restore=last_ckpt)

Tune also may copy or move checkpoints during the course of tuning. For this purpose,
Expand Down
2 changes: 1 addition & 1 deletion doc/source/tune/examples/tune-pytorch-cifar.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -298,7 +298,7 @@
" device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\"\n",
" best_trained_model.to(device)\n",
"\n",
" checkpoint_path = os.path.join(best_trial.checkpoint.value, \"checkpoint\")\n",
" checkpoint_path = os.path.join(best_trial.checkpoint.dir_or_data, \"checkpoint\")\n",
"\n",
" model_state, optimizer_state = torch.load(checkpoint_path)\n",
" best_trained_model.load_state_dict(model_state)\n",
Expand Down
4 changes: 2 additions & 2 deletions doc/source/tune/examples/tune-serve-integration-mnist.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -439,7 +439,7 @@
" best_trial = analysis.get_best_trial(\"mean_accuracy\", \"max\", \"last\")\n",
" best_accuracy = best_trial.metric_analysis[\"mean_accuracy\"][\"last\"]\n",
" best_trial_config = best_trial.config\n",
" best_checkpoint = best_trial.checkpoint.value\n",
" best_checkpoint = best_trial.checkpoint.dir_or_data\n",
"\n",
" return best_accuracy, best_trial_config, best_checkpoint, num_examples"
]
Expand Down Expand Up @@ -517,7 +517,7 @@
" best_trial = analysis.get_best_trial(\"mean_accuracy\", \"max\", \"last\")\n",
" best_accuracy = best_trial.metric_analysis[\"mean_accuracy\"][\"last\"]\n",
" best_trial_config = best_trial.config\n",
" best_checkpoint = best_trial.checkpoint.value\n",
" best_checkpoint = best_trial.checkpoint.dir_or_data\n",
"\n",
" return best_accuracy, best_trial_config, best_checkpoint, num_examples"
]
Expand Down
6 changes: 3 additions & 3 deletions python/ray/train/tests/test_tune.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def train_func():
TestTrainable = trainer.to_tune_trainable(train_func)

[trial] = tune.run(TestTrainable).trials
checkpoint_path = trial.checkpoint.value
checkpoint_path = trial.checkpoint.dir_or_data
assert os.path.exists(checkpoint_path)
checkpoint = Checkpoint.from_directory(checkpoint_path).to_dict()
assert checkpoint["hello"] == "world"
Expand All @@ -138,7 +138,7 @@ def train_func(config):
TestTrainable = trainer.to_tune_trainable(train_func)

[trial] = tune.run(TestTrainable, config={"max_iter": 5}).trials
checkpoint_path = trial.checkpoint.value
checkpoint_path = trial.checkpoint.dir_or_data
checkpoint = Checkpoint.from_directory(checkpoint_path).to_dict()
assert checkpoint["iter"] == 4
analysis = tune.run(TestTrainable, config={"max_iter": 10}, restore=checkpoint_path)
Expand All @@ -164,7 +164,7 @@ def train_func():
TestTrainable = trainer.to_tune_trainable(train_func)

analysis = tune.run(TestTrainable, max_failures=3)
checkpoint_path = analysis.trials[0].checkpoint.value
checkpoint_path = analysis.trials[0].checkpoint.dir_or_data
checkpoint = Checkpoint.from_directory(checkpoint_path).to_dict()
assert checkpoint["iter"] == 3

Expand Down
3 changes: 2 additions & 1 deletion python/ray/tune/analysis/experiment_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,8 @@ def get_trial_checkpoints_paths(
# Support metrics given as paths, e.g.
# "info/learner/default_policy/policy_loss".
return [
(c.value, unflattened_lookup(metric, c.result)) for c in checkpoints
(c.dir_or_data, unflattened_lookup(metric, c.metrics))
for c in checkpoints
]
else:
raise ValueError("trial should be a string or a Trial instance.")
Expand Down
4 changes: 2 additions & 2 deletions python/ray/tune/callback.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
from abc import ABCMeta
import warnings

from ray.tune.checkpoint_manager import _TuneCheckpoint
from ray.util.annotations import PublicAPI, DeveloperAPI
from ray.util.ml_utils.checkpoint_manager import _TrackedCheckpoint

if TYPE_CHECKING:
from ray.tune.trial import Trial
Expand Down Expand Up @@ -245,7 +245,7 @@ def on_checkpoint(
iteration: int,
trials: List["Trial"],
trial: "Trial",
checkpoint: _TuneCheckpoint,
checkpoint: _TrackedCheckpoint,
**info,
):
"""Called after a trial saved a checkpoint with Tune.
Expand Down
Loading