-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] pyarrow.fs
persistence (9/n): ray.train.Checkpoint
restore: Manual restore
#38128
Merged
ericl
merged 144 commits into
ray-project:master
from
justinvyu:air/persistence/restore_new_checkpoint
Aug 10, 2023
Merged
Changes from 142 commits
Commits
Show all changes
144 commits
Select commit
Hold shift + click to select a range
abb1307
Pipe storage context to Trainable (used now for Trainable syncing)
justinvyu f6ff90a
Don't use the storage context in the trial/trainable
justinvyu 562369f
Disable all trainable syncing in new codepath
justinvyu 95a3d20
Pipe storage context to Train workers (not actually used yet)
justinvyu 484e67f
Fix race condition for setting checkpoint_uri
justinvyu 2148669
Fix cyclical import
justinvyu 8c856b8
Add simple trainer test
justinvyu 78c525f
Add legacy prefix to train session checkpoint uri
justinvyu e97f471
Add new checkpoint class
justinvyu 64945be
New train session report implementation using new checkpoint
justinvyu c6480c9
Simplify checkpoint propagation from user code (in worker) -> trainer…
justinvyu c681ccb
New tune session.report
justinvyu 795bafe
Save direction works with new checkpoint API
justinvyu 8a084bc
Update test with e2e trainer test
justinvyu 725d802
Make callback supporting new checkpoint a todo for now
justinvyu 877acb9
Remove unnecessary comment
justinvyu ee4ccbd
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 88042b3
Separate out the new set checkpoint id from the old set checkpoint uri
justinvyu a5eeab2
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu a6cd9dc
Update id -> index
justinvyu 01f34bb
Address comments on error to raise with old ckpt type
justinvyu 65e7a27
Move checkpoint upload logic to a helper fn of storage ctx
justinvyu f2a4c36
Drop a checkpoint marker after uploading
justinvyu 49ee126
Add a simplified checkpoint manager
justinvyu ffa0dd4
Fixes to checkpoint manager
justinvyu 15553f7
Add unit test for simplified checkpoint manager
justinvyu 00cc9d7
Full test coverage
justinvyu cb5990e
Add a simplified checkpoint manager
justinvyu 2db9aae
Fixes to checkpoint manager
justinvyu a2067b7
Add unit test for simplified checkpoint manager
justinvyu f1216f2
Full test coverage
justinvyu d4243e6
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 6699d81
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 9b9ff34
Simplify even more
justinvyu 83aecd9
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 913af10
Patch fix for circular imports
justinvyu 6b5d34e
Use new checkpoint manager in Tune ckpt book-keeping
justinvyu 24f441a
Update result to return a train.Checkpoint to the user
justinvyu 504ed54
Update e2e test to try multiple ckpt configs for trainer test
justinvyu 1992161
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu b9eb88f
Fix lint for trial.py
justinvyu a6115b3
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 7cc74d9
Rename _TrackedCheckpoint -> _TrainingResult
justinvyu 6a0e1fb
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 4662789
Merge branch 'air/persistence/simplified_ckpt_manager' into air/persi…
justinvyu 8da0477
Fixes after merging latest ckpt manager changes
justinvyu 255b149
Remove prints / convert to logger.debug
justinvyu 0971aca
Don't set training iteration as the default checkpoint_score_attr
justinvyu 6e7a873
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 6f6a341
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu d9804b0
Fix test to reflect working dir change
justinvyu 318158f
Don't upload a .is_checkpoint marker
justinvyu a54664c
Add back cwd check
justinvyu c4263ec
Update the dir trees + better naming for ckpt shards and artifacts
justinvyu 0cd7e47
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 3a6eba6
A different fix for the circular dep
justinvyu b65e9fe
Update checkpoint -> _checkpoint imports
justinvyu b89bd1c
fix lint
justinvyu cada06a
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 3b784d7
Revert all changes to ckpt manager
justinvyu 49c1ead
Don't set checkpoint user metadata
justinvyu 7177940
Remove remaining print
justinvyu ae8a9ec
Add trial_path property to storage ctx
justinvyu c1c8441
Use storage context for all experiment/trial path properties
justinvyu 5d2ca07
Don't skip trainer test cases for custom_fs
justinvyu 1fcfb3f
Split some utilities into helper methods + test for ResultGrid paths
justinvyu 5e2a933
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 61fdadf
Prepend legacy to old path attributes in trial
justinvyu d38cd87
Remove todo
justinvyu 3ba944a
Bump the test size
justinvyu 5f71608
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 52d6c14
Merge branch 'air/persistence/new_checkpoint' into air/persistence/fi…
justinvyu 9c16120
Clean up experiment path handling
justinvyu 76468d9
Fix for base trainer
justinvyu b17c17e
Fix for base trainer pt 2
justinvyu 30e3328
Add in missing legacy property
justinvyu f25ad39
Prepend legacy to old path attributes in experiment
justinvyu e1846ec
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu d11ede8
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu c99d30c
too much space
justinvyu 9d11d2d
remove unused var
justinvyu 950b991
Fix lint
justinvyu ed86255
restore mostly works
justinvyu de4b924
hacky way of getting checkpoint folders to increment correctly
justinvyu e060476
Fix for xgboost trainer
justinvyu bd5c846
Fix race as_directory / download file lock race condition
justinvyu e51fb17
Update test with auto-recovery fault tolerance
justinvyu 6368075
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu eaa26c5
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 0c3c5c8
compute storage_prefix
justinvyu 217af77
Remove '_path' properties from storage
justinvyu 8e2330c
Move exp dir name helper to storage ctx
justinvyu 7502cca
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu f3f22fd
Fix bugs causing broken CI
justinvyu 2c9adf4
Merge branch 'air/persistence/fix_custom_fs_path_expansion' into air/…
justinvyu dc1c7c3
Fix syncing needed logic to handle storage path == local path case
justinvyu a262cb3
working for manual Trainer.restore
justinvyu e59b408
Add manual restore to e2e test
justinvyu c6c3dfe
Fix renamed attribute in mock test class
justinvyu 4e56bd0
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 314e8bd
Merge branch 'air/persistence/fix_custom_fs_path_expansion' into air/…
justinvyu 36464af
fix storage attr setting to only happen if ff enabled
justinvyu 6e73f6e
cleanup on errors in as_directory
justinvyu c7de72a
Support + test resume_from_checkpoint
justinvyu db016da
Fix result grid bug when no checkpoints saved
justinvyu bcbcec9
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu d7497e1
fix merge conflict remainder
justinvyu aeb89ba
Recover trainable metadata from last_result rather than .tune_metadata
justinvyu 9556371
Fix restore info log
justinvyu e897eaa
Keep current checkpoint index synchronized on the driver
justinvyu 3eef417
Remove checkpoint dirname parsing
justinvyu 0e52384
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 89631ab
Update todo comment
justinvyu d4e20f2
Fix lint
justinvyu 877c4ae
Merge branch 'air/persistence/restore_new_checkpoint_autoft' into air…
justinvyu 194dc37
Merge branch 'air/persistence/restore_new_checkpoint_autoft' into air…
justinvyu 5c50dd4
Some small imports cleanup
justinvyu d367e8d
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu d329e1b
Fix e2e test for storage_path=None case
justinvyu e382e29
Remove unused code
justinvyu 550575e
Merge branch 'air/persistence/restore_new_checkpoint_rfc' into air/pe…
justinvyu 86367ef
Guard new codepath correctly
justinvyu 3317e3b
Separate out fs resolution into a helper
justinvyu 6fb9064
Add custom filesystem arg on restore
justinvyu f388dcc
Don't skip the custom fs test case for restore
justinvyu d881c0d
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu ade1078
clean up some imports
justinvyu 8b78dee
Fix the test fixtures
justinvyu 075c6a6
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu cf963fd
Remove done todo
justinvyu 1d87dde
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu ca5f5bf
Fix optional can_restore argument
justinvyu 75e947c
Remove duplicate in test
justinvyu 5c1c282
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu a13d88d
read file directly from fs for trainer restore
justinvyu 7e468c6
check for existence rather than list
justinvyu 13c224f
Update tuner restore
justinvyu 1085666
Mark experiment_checkpoint_dir as legacy
justinvyu 0957d68
Revert changes to sync down logic in trainer
justinvyu 81c307f
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu 1716fa7
Fix lint
justinvyu 5b6be43
minor fixes
justinvyu 413ed38
Merge branch 'master' of https://github.com/ray-project/ray into air/…
justinvyu ca0a3a0
Remove backwards compatibility test
justinvyu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just a passthrough - no need for this class to implement it.