-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403
[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403
Conversation
Signed-off-by: Justin Yu <[email protected]>
…rate_driver_and_trial_artifacts
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]> update trainer._save usage in test Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…ting for driver sync Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…rate_driver_and_trial_artifacts
…rate_driver_and_trial_artifacts
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see the tests become green!
As this is a large PR changed a bunch of behaviors and tests. Can you summarize the following things in the PR descriptions?
- Behavior changes
- changes in the way of driver syncing (Directly write to storage path)
- how to separate trial and driver artifact directory
- changes in the default storage path (
ray_storage_uri
) - ...
- The compromise we've done
- The bypassed unit tests
- sync_artifacts
- ...
- The TODOs
- doc and faqs to update
Also, before we merge the PR, It'd also be good to run a few release tests to ensure it also works under multi-node setting.
# Timestamp is used to create a unique session directory for the current | ||
# training job. This is used to avoid conflicts when multiple training jobs | ||
# run with the same name in the same cluster. | ||
# This is set ONCE at the creation of the storage context, on the driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me remember how did we resolved the consistency issue with the timestamp. Did we resolve it by writing files into driver directory and using background syncing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the first bullet point in the PR description under "Some undesirable workarounds made in this PR."
# NOTE: The restored run should reuse the same driver staging directory. | ||
self._storage._timestamp = trials[0].storage._timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workaround to make sure that a restored run uses the same timestamped staging dir.
…rate_driver_and_trial_artifacts
Signed-off-by: Justin Yu <[email protected]>
I removed the |
Change summary
local_dir
concept as a driver staging directory and moves its default location from~/ray_results
to a subfolder in the ray temp directory (/tmp/ray/session_*/artifacts
).ray.init(_temp_dir=...)
.storage.trial_working_directory
and will only be synced ifsync_artifacts=True
.storage_path
resolution.storage_path
is now resolved immediately onRunConfig
initialization, rather than resolving downstream duringtune.run
. It gets set to the Ray storage URI if that's setup -- otherwise, it defaults to~/ray_results
.~/ray_results
was already the default location ofstorage_path
. The difference now is that~/ray_results
no longer be populated if thestorage_path
is set to something else.syncer=None
special case whenlocal_dir == storage_path
.local_dir
, (2) they get uploaded tostorage_path
in a background task (this is "driver syncing").SyncConfig(sync_timeout)
, which is 5 minutes by default. Now that the local directory only contains driver artifacts, the upload is not so expensive, making it okay to upload synchronously right after saving.storage_path
.num_to_keep
now that uploading driver files is no longer gated by the 5 minutesync_period
default. That workaround was intended to increase driver checkpointing frequency, which is also achieved by change.RunConfig(name)
don't have conflicting staging directories.tuner.pkl
,trainer.pkl
) directly to storage #43369 (comment)StorageContext
gets a property of the current timestamp upon creation (which happens on the driver once)./tmp/ray/session_*
folder is only available when ray init has been called. I included the ray start fixture in many unit tests to get around this.mock_storage_context
patches out this directory to a tempdir.Trial
andExperiment
path properties once and for all.RAY_AIR_LOCAL_CACHE_DIR
everywhere.ray.train.get_context().get_local_dir
doesn't really make sense anymore.Context / motivation for this change
What are the main user problems to solve?
~/ray_results
to the ray session dir that already gets populated and is easily accessible / configurable.lightning_logs
,wandb
, transformersoutput_dir
) in the working directory can be synced unintentionally #40634RunConfig(name)
and run multiple experiments, each consecutive experiment will upload ALL “trial directories” so far from the local~/ray_results
.~/ray_results/<name>
folder.Related issue number
Closes #40634
Closes #40009
Closes #38522
#42630
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.