[Tune] Fix checkpoint directory assignment for new checkpoints created after restoring a function trainable #31231

justinvyu · 2022-12-20T09:28:54Z

This PR fixes checkpoint directory creation for restored function trainables to use the restored iteration instead of starting over from checkpoint_000000.

Why are these changes needed?

The _StatusReporter that handles checkpoint directory creation for function trainables keeps track of an _iter that closely follows the Trainable training_iteration, and it's used to create checkpoint directories (ex: checkpoint_000000).

Upon restoring a trial, this iteration is not restored properly and starts from 0, which results in a new checkpoint possibly overwriting an old one at the checkpoint_000000 path.

The ticket below has more details, but the basic failure case:

1st checkpoint comes in, saved under checkpoint_000000
Experiment is interrupted and gets restored
2nd checkpoint comes in, but still saved under checkpoint_000000
1st checkpoint is now overwritten

Solution

Don't keep track of an _iter separately in the session - use the trainable's current training_iteration instead.

Open Questions

Class trainables and function trainables have different indexing: Class trainables are 1-indexed with checkpoints matching the training_iteration, so the first checkpoint is saved as checkpoint_000001. Function trainables are 0-indexed. This PR could change it to make this consistent, and I think that the class trainable indexing makes more sense rather than always being 1-off with respect to the iteration number. Decision: This would break backwards compatibility when restoring experiments that have 0 indexed checkpoints, so this will be left for a future PR.

Related issue number

Closes #29947

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

krfricke

Looks good, only one clarification

krfricke · 2022-12-21T13:40:02Z

python/ray/tune/trainable/function_trainable.py

+        result_queue: queue.Queue,
+        continue_semaphore: threading.Semaphore,
+        end_event: threading.Event,
+        training_iteration_func: Callable[[], int],
+        experiment_name: Optional[str] = None,
+        trial_name: Optional[str] = None,
+        trial_id: Optional[str] = None,
+        logdir: Optional[str] = None,
+        trial_resources: Optional[Union[Resources, PlacementGroupFactory]] = None,


python/ray/tune/trainable/function_trainable.py

Signed-off-by: Justin Yu <[email protected]>

…ored_checkpoint_idx

…31423) This PR is a follow-up to #31231 to save checkpoints to the correctly indexed directory upon restore. The "latest checkpoint ID" that's used to generate the next checkpoint directory (`checkpoint_0000<latest_checkpoint_id>`) is off by one when restoring an AIR trainer. Signed-off-by: Justin Yu <[email protected]>

…d after restoring a function trainable (#31231) This PR fixes checkpoint directory creation for restored function trainables to use the restored iteration instead of starting over from `checkpoint_000000`. Signed-off-by: Justin Yu <[email protected]>

…31423) This PR is a follow-up to #31231 to save checkpoints to the correctly indexed directory upon restore. The "latest checkpoint ID" that's used to generate the next checkpoint directory (`checkpoint_0000<latest_checkpoint_id>`) is off by one when restoring an AIR trainer. Signed-off-by: Justin Yu <[email protected]>

…d after restoring a function trainable (ray-project#31231) This PR fixes checkpoint directory creation for restored function trainables to use the restored iteration instead of starting over from `checkpoint_000000`. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: tmynn <[email protected]>

…ay-project#31423) This PR is a follow-up to ray-project#31231 to save checkpoints to the correctly indexed directory upon restore. The "latest checkpoint ID" that's used to generate the next checkpoint directory (`checkpoint_0000<latest_checkpoint_id>`) is off by one when restoring an AIR trainer. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: tmynn <[email protected]>

justinvyu added 4 commits December 19, 2022 18:54

Tie StatusReporter checkpoint iter to trainable iteration

c078943

Signed-off-by: Justin Yu <[email protected]>

Revert back to checkpoints being 0-indexed

df2c34a

Signed-off-by: Justin Yu <[email protected]>

Add unit test for checkpointing after resume

7d5ee61

Signed-off-by: Justin Yu <[email protected]>

Fix marker for session report vs. tune report

766e574

Signed-off-by: Justin Yu <[email protected]>

justinvyu mentioned this pull request Dec 20, 2022

[Tune] Reporting a new checkpoint with metadata from a loaded checkpoint can cause issues #31248

Closed

justinvyu marked this pull request as ready for review December 20, 2022 22:01

justinvyu assigned Yard1 and krfricke Dec 20, 2022

justinvyu requested review from krfricke and Yard1 December 20, 2022 22:02

krfricke approved these changes Dec 21, 2022

View reviewed changes

justinvyu added 2 commits December 21, 2022 09:52

Clarify naming of air session has_reported marker

97d8aa4

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into rest…

1e1a14d

…ored_checkpoint_idx

krfricke merged commit 77b94ab into ray-project:master Dec 22, 2022

justinvyu mentioned this pull request Jan 4, 2023

[Train] Fix off-by-one AIR Trainer checkpoint ID indexing on restore #31423

Merged

7 tasks

justinvyu deleted the restored_checkpoint_idx branch April 10, 2023 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] Fix checkpoint directory assignment for new checkpoints created after restoring a function trainable #31231

[Tune] Fix checkpoint directory assignment for new checkpoints created after restoring a function trainable #31231

justinvyu commented Dec 20, 2022 •

edited

Loading

krfricke left a comment

krfricke Dec 21, 2022

[Tune] Fix checkpoint directory assignment for new checkpoints created after restoring a function trainable #31231

[Tune] Fix checkpoint directory assignment for new checkpoints created after restoring a function trainable #31231

Conversation

justinvyu commented Dec 20, 2022 • edited Loading

Why are these changes needed?

Solution

Open Questions

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

krfricke Dec 21, 2022

Choose a reason for hiding this comment

justinvyu commented Dec 20, 2022 •

edited

Loading