Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Feb 23, 2024

Change summary

  • Behavior changes:
    • Rebrands the local_dir concept as a driver staging directory and moves its default location from ~/ray_results to a subfolder in the ray temp directory (/tmp/ray/session_*/artifacts).
      • Delegates the customization of this folder to ray.init(_temp_dir=...).
      • Increases the visibility of this staging directory so that users can find and delete it if they want to. Ray already logs things to this directory, so it's a natural directory for users to interact with.
    • Provides trial actors (Tune) and distributed training workers (Train) an empty working directory that will only be populated by worker artifacts.
      • This working directory is the storage.trial_working_directory and will only be synced if sync_artifacts=True.
      • This is the same behavior as before, with the added benefit that driver syncing no longer unintentionally uploads worker artifacts.
    • Simplifies the default storage_path resolution.
      • storage_path is now resolved immediately on RunConfig initialization, rather than resolving downstream during tune.run. It gets set to the Ray storage URI if that's setup -- otherwise, it defaults to ~/ray_results.
      • Prior to this PR, ~/ray_results was already the default location of storage_path. The difference now is that ~/ray_results no longer be populated if the storage_path is set to something else.
    • Syncing from the staging directory to the storage path is now the only codepath.
      • There is no more syncer=None special case when local_dir == storage_path.
      • This means that for single node runs, there is some unnecessary file-copying, but this is an ok sacrifice for simplifying the codepaths.
    • Experiment checkpoint saving and uploading now happens synchronously.
      • Previously, this happened in 2 steps: (1) the files would get saved to local_dir, (2) they get uploaded to storage_path in a background task (this is "driver syncing").
      • The frequency of the 2nd step gets gated by SyncConfig(sync_timeout), which is 5 minutes by default. Now that the local directory only contains driver artifacts, the upload is not so expensive, making it okay to upload synchronously right after saving.
      • The benefit of this is better consistency of the experiment state in storage_path.
      • I also removed the forced syncing behavior of num_to_keep now that uploading driver files is no longer gated by the 5 minute sync_period default. That workaround was intended to increase driver checkpointing frequency, which is also achieved by change.
      • See 1 and 2 for context on the motivation for this change.
  • Some undesirable workarounds made in this PR:
    • Adding a unique timestamp to the path of the driver staging directory so that different experiments with the same RunConfig(name) don't have conflicting staging directories.
    • The /tmp/ray/session_* folder is only available when ray init has been called. I included the ray start fixture in many unit tests to get around this. mock_storage_context patches out this directory to a tempdir.
  • Remaining TODOs for follow-ups:

Context / motivation for this change

What are the main user problems to solve?

Related issue number

Closes #40634
Closes #40009
Closes #38522

#42630

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>

update trainer._save usage in test

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see the tests become green!

As this is a large PR changed a bunch of behaviors and tests. Can you summarize the following things in the PR descriptions?

  • Behavior changes
    • changes in the way of driver syncing (Directly write to storage path)
    • how to separate trial and driver artifact directory
    • changes in the default storage path (ray_storage_uri)
    • ...
  • The compromise we've done
    • The bypassed unit tests
    • sync_artifacts
    • ...
  • The TODOs
    • doc and faqs to update

Also, before we merge the PR, It'd also be good to run a few release tests to ensure it also works under multi-node setting.

# Timestamp is used to create a unique session directory for the current
# training job. This is used to avoid conflicts when multiple training jobs
# run with the same name in the same cluster.
# This is set ONCE at the creation of the storage context, on the driver.
Copy link
Member

@woshiyyya woshiyyya Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me remember how did we resolved the consistency issue with the timestamp. Did we resolve it by writing files into driver directory and using background syncing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the first bullet point in the PR description under "Some undesirable workarounds made in this PR."

Comment on lines +459 to +460
# NOTE: The restored run should reuse the same driver staging directory.
self._storage._timestamp = trials[0].storage._timestamp
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workaround to make sure that a restored run uses the same timestamped staging dir.

@justinvyu
Copy link
Contributor Author

justinvyu commented Mar 5, 2024

train_multinode_persistence release test passed here: https://buildkite.com/ray-project/release/builds/10131#018e0d55-eb73-42aa-abe9-e3ba603a61f5

I removed the tune_cloud_durable_upload test because it's basically a duplicate of the test above, but also relies on "local directory" implementation details. I need to add back a tune multi-node persistence test as a follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants