Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[train+tune] Local directory refactor (2/n): Separate driver artifacts and trial working directories #43403

Merged
Merged
Show file tree
Hide file tree
Changes from 30 commits
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
c157ad9
add util for getting ray train session tmp dir
justinvyu Feb 22, 2024
7891989
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 22, 2024
ad3b136
remove storage local path and introduce driver staging + working dirs
justinvyu Feb 22, 2024
d8b4800
update trial chdir to use trial_working_dir
justinvyu Feb 22, 2024
614fefa
rename experiment_local_path -> experiment_local_staging_path
justinvyu Feb 22, 2024
08d9892
rename trial_local_path -> trial_local_staging_path
justinvyu Feb 22, 2024
dd1f202
fix incorrect worker artifact sync dir
justinvyu Feb 22, 2024
aab939e
update syncer = None codepaths
justinvyu Feb 22, 2024
f981b6a
fix test_storage
justinvyu Feb 22, 2024
7d45920
fix cwd assert to use resolved path in test
justinvyu Feb 22, 2024
41ef191
storage_path default = ~/ray_results
justinvyu Feb 22, 2024
bf323fa
upload trainer pkl directly
justinvyu Feb 22, 2024
dee3249
upload tuner pkl directly
justinvyu Feb 22, 2024
bdd58b3
revert storage path default
justinvyu Feb 22, 2024
0bd1a58
fix optional storage path dependencies for now
justinvyu Feb 22, 2024
24a9fc0
remove todo
justinvyu Feb 22, 2024
90be9ca
small correction...
justinvyu Feb 23, 2024
197a29e
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 23, 2024
c18384a
remove ipdb
justinvyu Feb 23, 2024
4978c45
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 23, 2024
f55ace2
remove some hacks in test
justinvyu Feb 23, 2024
75ef6bd
upload exp state (with trial states) directly to cloud instead of wai…
justinvyu Feb 23, 2024
cab5e12
use converted trainable in tuner entrypoint
justinvyu Feb 23, 2024
09e0273
use non-optional run config
justinvyu Feb 23, 2024
c0d0ba0
remove local restoration test
justinvyu Feb 23, 2024
011ac92
keep base trainer and tuner exp dir name resolution consistent
justinvyu Feb 23, 2024
c3c03ac
add test case for restoration with default RunConfig(name)
justinvyu Feb 23, 2024
25cccb5
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 23, 2024
0a0e37c
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 23, 2024
2577b91
storage path = ~/ray_results by default
justinvyu Feb 23, 2024
46cbf02
override storage path with ray storage if set
justinvyu Feb 23, 2024
343224b
Fix lint
justinvyu Feb 23, 2024
b367413
centralize on storage context for path handling in the tuner/trainer …
justinvyu Feb 23, 2024
417a7da
fix errors caused by syncing being enabled to the same dir
justinvyu Feb 23, 2024
6b10a13
key concepts small fix
justinvyu Feb 23, 2024
c218937
separate exp folders for doc code
justinvyu Feb 23, 2024
805757b
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 23, 2024
231d637
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 23, 2024
063d3b2
no need to copy sync_config anymore
justinvyu Feb 23, 2024
f46dd84
different way to do default
justinvyu Feb 23, 2024
21dcd4c
fix test_new_persistence
justinvyu Feb 23, 2024
178d8ca
fix test_tuner_restore
justinvyu Feb 23, 2024
c1357e4
fix lint
justinvyu Feb 23, 2024
55e069a
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 26, 2024
39eaf81
fix trainer._save test usage
justinvyu Feb 26, 2024
75b80a6
use unique exp names for test
justinvyu Feb 27, 2024
876aad2
fix run config validation test
justinvyu Feb 27, 2024
85c2acd
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 27, 2024
1c4dfed
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 27, 2024
d509ff0
set storage path by default in tune.run path as well
justinvyu Feb 27, 2024
7d92e00
no need to test storage_path=None in unit test
justinvyu Feb 27, 2024
42d1378
fix circular import for default storage path const
justinvyu Feb 27, 2024
af68fe1
don't allow accessing ray train session dir outside of ray session
justinvyu Feb 27, 2024
676d5a3
fix test_session
justinvyu Feb 27, 2024
091f3bc
make session dir helper a private fn
justinvyu Feb 27, 2024
a411863
fix storage docstring example
justinvyu Feb 27, 2024
dafc828
[remove RAY_AIR_LOCAL_CACHE_DIR] test_storage
justinvyu Feb 27, 2024
fa61411
[remove RAY_AIR_LOCAL_CACHE_DIR] test_actor_reuse + test_result
justinvyu Feb 27, 2024
96c9a67
[remove RAY_AIR_LOCAL_CACHE_DIR] legacy local_dir test
justinvyu Feb 27, 2024
825cdb6
[remove RAY_AIR_LOCAL_CACHE_DIR] test_tune
justinvyu Feb 27, 2024
3727e3a
[remove RAY_AIR_LOCAL_CACHE_DIR] test_exp_analysis + test_api
justinvyu Feb 27, 2024
fce8a0a
add test for customizing train session dir
justinvyu Feb 27, 2024
b3610de
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 27, 2024
df25a79
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 27, 2024
43d53a0
remove tuner try catch that relies on get_exp_ckpt_dir
justinvyu Feb 27, 2024
533c807
add comment about storage context at the top entrypoint layers
justinvyu Feb 27, 2024
185f57c
fix bug where new storage filesystem is not used on restoration (the …
justinvyu Feb 27, 2024
8f63111
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 27, 2024
baeea52
pass in param space again on restore
justinvyu Feb 27, 2024
6f5277a
Merge branch 'master' of https://github.com/ray-project/ray into uplo…
justinvyu Feb 28, 2024
cdeaa44
fix test_errors
justinvyu Feb 28, 2024
d309990
add timestamp to ray train session dir so that each ray train job get…
justinvyu Feb 28, 2024
e4cf791
dump file and upload entire driver staging dir on exp checkpointing
justinvyu Feb 28, 2024
8fca047
fetch errors from storage and fallback to local staging
justinvyu Feb 28, 2024
559191d
fix trial working dir for class trainables
justinvyu Feb 28, 2024
c96c440
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 28, 2024
4324b50
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 28, 2024
bdd4864
fix can_restore test + remove unused monkeypatch
justinvyu Feb 28, 2024
13aabf4
fix lint
justinvyu Feb 28, 2024
38cfb0f
move find newest exp ckpt logic to exp state manager file
justinvyu Feb 29, 2024
0cfaa90
read exp state directly from storage + restore controller state befor…
justinvyu Feb 29, 2024
439b409
fix test_tuner
justinvyu Feb 29, 2024
560adbc
remove exp state manager resume
justinvyu Feb 29, 2024
1e58f46
remove env var usage from test_tune
justinvyu Feb 29, 2024
8c113b9
remove local_dir legacy test
justinvyu Feb 29, 2024
2ebad79
Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…
justinvyu Feb 29, 2024
4f03d7f
only fetch error file if the trial error file has been set
justinvyu Feb 29, 2024
f2c0f38
remove arbitrary length check on repr
justinvyu Feb 29, 2024
9bbede6
fix mlflow test
justinvyu Feb 29, 2024
dbf58aa
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 29, 2024
30bdc39
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 29, 2024
1573096
fix test_training_iterator
justinvyu Feb 29, 2024
c3076d9
fix test_tuner_resume_errored_only
justinvyu Feb 29, 2024
a51358f
ray init in test_function_api
justinvyu Feb 29, 2024
abd117f
fix test_api_ckpt_integration
justinvyu Feb 29, 2024
e7d8f9d
make driver staging dir during exp checkpoint if not created yet
justinvyu Feb 29, 2024
03d7ef7
fix a bunch of tests
justinvyu Feb 29, 2024
6cc5c75
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Feb 29, 2024
8a3430e
add test for customizing train session dir
justinvyu Feb 27, 2024
85ec91c
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Mar 1, 2024
b1551c5
fix some tests in test_api
justinvyu Mar 1, 2024
6b98726
Merge branch 'cleanup_air_cache_dir' into separate_driver_and_trial_a…
justinvyu Mar 1, 2024
6007618
fix test_actor_caching
justinvyu Mar 1, 2024
182d069
fix test_actor_reuse
justinvyu Mar 1, 2024
3cb5fc3
patch call to get ray train session dir in mock storage context
justinvyu Mar 1, 2024
af8bf05
remove unneeded ray.init
justinvyu Mar 1, 2024
815cc0b
fix test_run_experiment
justinvyu Mar 1, 2024
a4a51f4
fix test_var
justinvyu Mar 1, 2024
a1543e0
fix pytest.skip -> pytest.mark.skip
justinvyu Mar 1, 2024
af3fd67
skip some tests
justinvyu Mar 1, 2024
69dcd35
revert test_training_iterator
justinvyu Mar 1, 2024
23277eb
fix tutorial
justinvyu Mar 1, 2024
24a779a
fix test_trial + remove delete_syncer option in utility
justinvyu Mar 1, 2024
87aa000
only pull error from storage
justinvyu Mar 1, 2024
5154421
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Mar 1, 2024
a6fde03
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Mar 1, 2024
e0125fa
fix merge error
justinvyu Mar 1, 2024
9260fa9
move ray storage handling to RunConfig
justinvyu Mar 1, 2024
bd2f36a
improve docstrings + mark storage context as developer api
justinvyu Mar 1, 2024
417bd43
improve storage path docstring
justinvyu Mar 1, 2024
3e59baa
add storage_fs docstring
justinvyu Mar 1, 2024
3b0b96a
rename
justinvyu Mar 1, 2024
ad0a63f
fix logdir -> trial working dir
justinvyu Mar 1, 2024
64126ba
fix lint
justinvyu Mar 1, 2024
e3991e7
Merge branch 'master' of https://github.com/ray-project/ray into sepa…
justinvyu Mar 5, 2024
83733dc
remove tune cloud durable
justinvyu Mar 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 18 additions & 5 deletions python/ray/air/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

import pyarrow.fs

from ray._private.storage import _get_storage_uri
from ray._private.thirdparty.tabulate.tabulate import tabulate
from ray.util.annotations import PublicAPI, Deprecated
from ray.widgets import Template, make_table_html_repr
Expand Down Expand Up @@ -581,10 +582,14 @@ class RunConfig:
Args:
name: Name of the trial or experiment. If not provided, will be deduced
from the Trainable.
storage_path: [Beta] Path to store results at. Can be a local directory or
a destination on cloud storage. If Ray storage is set up,
defaults to the storage location. Otherwise, this defaults to
the local ``~/ray_results`` directory.
storage_path: [Beta] Path where all results and checkpoints are persisted.
Can be a local directory or a destination on cloud storage.
For multi-node training/tuning runs, this must be set to a
shared storage location (e.g., S3, NFS).
This defaults to the local ``~/ray_results`` directory.
storage_filesystem: [Beta] A custom filesystem to use for storage.
If this is provided, `storage_path` should be a path with its
prefix stripped (e.g., `s3://bucket/path` -> `bucket/path`).
failure_config: Failure mode configuration.
checkpoint_config: Checkpointing configuration.
sync_config: Configuration object for syncing. See train.SyncConfig.
Expand Down Expand Up @@ -638,9 +643,17 @@ def __post_init__(self):
from ray.tune.experimental.output import AirVerbosity, get_air_verbosity

if self.storage_path is None:

self.storage_path = DEFAULT_STORAGE_PATH
justinvyu marked this conversation as resolved.
Show resolved Hide resolved

# If no remote path is set, try to get Ray Storage URI
ray_storage_uri: Optional[str] = _get_storage_uri()
if ray_storage_uri is not None:
logger.info(
"Using configured Ray Storage URI as the `storage_path`: "
f"{ray_storage_uri}"
)
self.storage_path = ray_storage_uri

if not self.failure_config:
self.failure_config = FailureConfig()

Expand Down
73 changes: 38 additions & 35 deletions python/ray/train/_internal/storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,10 @@
from pathlib import Path
from typing import TYPE_CHECKING, Callable, Dict, List, Optional, Tuple, Type, Union

from ray._private.storage import _get_storage_uri
from ray.air._internal.filelock import TempFileLock
from ray.train._internal.syncer import SyncConfig, Syncer, _BackgroundSyncer
from ray.train.constants import _get_ray_train_session_dir
from ray.util.annotations import DeveloperAPI

if TYPE_CHECKING:
from ray.train._checkpoint import Checkpoint
Expand Down Expand Up @@ -345,16 +345,22 @@ def _delete_command(self, uri: str) -> Tuple[Callable, Dict]:
return _delete_fs_path, dict(fs=self.storage_filesystem, fs_path=fs_path)


@DeveloperAPI
class StorageContext:
"""Shared context that holds all paths and storage utilities, passed along from
the driver to workers.
"""Shared context that holds the source of truth for all paths and
storage utilities, passed along from the driver to workers.

There are 2 types of paths:
This object defines a few types of paths:
1. *_fs_path: A path on the `storage_filesystem`. This is a regular path
which has been prefix-stripped by pyarrow.fs.FileSystem.from_uri and
can be joined with `Path(...).as_posix()`.
2. *_local_staging_path: The temporary path on the local filesystem where results
are saved to before persisting them to storage.
2. *_driver_staging_path: The temporary staging directory on the local filesystem
where driver artifacts are saved to before persisting them to storage.
3. trial_working_directory: The local filesystem path that the remote
actors' working directories are moved to by default.
This is separated from the driver staging path so that driver syncing
does not implicitly upload the trial working directory, for trials on the
driver node.

Example with storage_path="mock:///bucket/path?param=1":

Expand All @@ -370,12 +376,12 @@ class StorageContext:
<pyarrow._fs._MockFileSystem object...
>>> storage.experiment_fs_path
'bucket/path/exp_name'
>>> storage.experiment_local_staging_path # doctest: +ELLIPSIS
>>> storage.experiment_driver_staging_path # doctest: +ELLIPSIS
'/tmp/ray/session_.../artifacts/.../exp_name/driver_artifacts'
>>> storage.trial_dir_name = "trial_dir"
>>> storage.trial_fs_path
'bucket/path/exp_name/trial_dir'
>>> storage.trial_local_staging_path # doctest: +ELLIPSIS
>>> storage.trial_driver_staging_path # doctest: +ELLIPSIS
'/tmp/ray/session_.../artifacts/.../exp_name/driver_artifacts/trial_dir'
>>> storage.trial_working_directory # doctest: +ELLIPSIS
'/tmp/ray/session_.../artifacts/.../exp_name/working_dirs/trial_dir'
Expand Down Expand Up @@ -406,6 +412,10 @@ class StorageContext:
Path(storage.trial_fs_path, "subdir").as_posix(),
destination_filesystem=storage.filesystem
)

.. warning::
This is an experimental developer API and is subject to change
without notice between versions.
"""

def __init__(
Expand All @@ -421,15 +431,6 @@ def __init__(

self.custom_fs_provided = storage_filesystem is not None

# If no remote path is set, try to get Ray Storage URI
ray_storage_uri: Optional[str] = _get_storage_uri()
if ray_storage_uri is not None:
logger.info(
"Using configured Ray Storage URI as the `storage_path`: "
f"{ray_storage_uri}"
)
storage_path = ray_storage_uri

# Invariant: (`storage_filesystem`, `storage_path`) is the location where
# *all* results can be accessed.
self.experiment_dir_name = experiment_dir_name
Expand All @@ -451,6 +452,10 @@ def __init__(
self._create_validation_file()
self._check_validation_file()

# Timestamp is used to create a unique session directory for the current
# training job. This is used to avoid conflicts when multiple training jobs
# run with the same name in the same cluster.
# This is set ONCE at the creation of the storage context, on the driver.
Copy link
Member

@woshiyyya woshiyyya Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me remember how did we resolved the consistency issue with the timestamp. Did we resolve it by writing files into driver directory and using background syncing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the first bullet point in the PR description under "Some undesirable workarounds made in this PR."

self._timestamp = date_str()

def __str__(self):
Expand Down Expand Up @@ -591,8 +596,15 @@ def experiment_fs_path(self) -> str:
"""
return Path(self.storage_fs_path, self.experiment_dir_name).as_posix()

def _get_session_path(self) -> str:
"""The Ray Train/Tune session local directory used to stage files
before persisting to the storage filesystem."""
return Path(
_get_ray_train_session_dir(), self._timestamp, self.experiment_dir_name
).as_posix()

@property
def experiment_local_staging_path(self) -> str:
def experiment_driver_staging_path(self) -> str:
"""The local filesystem path of the experiment directory on the driver node.

The driver is the node where `Trainer.fit`/`Tuner.fit` is being called.
Expand All @@ -602,20 +614,15 @@ def experiment_local_staging_path(self) -> str:
<experiment_dir_name>/driver_artifacts`

This should be used as the temporary staging location for files *on the driver*
before syncing them to `(storage_filesystem, storage_path)`.
before syncing them to `experiment_fs_path`.
For example, the search algorithm should dump its state to this directory.
See `trial_staging_path` for files that should be written to trial folders.
See `trial_driver_staging_path` for writing trial-specific artifacts.

The directory is synced to
`{storage_path}/{experiment_dir_name}` periodically.
See `_ExperimentCheckpointManager.checkpoint` for where that happens.
"""
return Path(
_get_ray_train_session_dir(),
self._timestamp,
self.experiment_dir_name,
"driver_artifacts",
).as_posix()
return Path(self._get_session_path(), "driver_artifacts").as_posix()

@property
def trial_fs_path(self) -> str:
Expand All @@ -630,7 +637,7 @@ def trial_fs_path(self) -> str:
return Path(self.experiment_fs_path, self.trial_dir_name).as_posix()

@property
def trial_local_staging_path(self) -> str:
def trial_driver_staging_path(self) -> str:
"""The local filesystem path of the trial directory on the driver.

The driver is the node where `Trainer.fit`/`Tuner.fit` is being called.
Expand All @@ -640,17 +647,17 @@ def trial_local_staging_path(self) -> str:
<experiment_dir_name>/driver_artifacts/<trial_dir_name>`

This should be used as the temporary location for files on the driver
before syncing them to `(storage_filesystem, storage_path)`.
before persisting them to `trial_fs_path`.

For example, callbacks (e.g., JsonLoggerCallback) should write trial-specific
logfiles within this directory.
"""
if self.trial_dir_name is None:
raise RuntimeError(
"Should not access `trial_local_staging_path` "
"Should not access `trial_driver_staging_path` "
"without setting `trial_dir_name`"
)
return Path(self.experiment_local_staging_path, self.trial_dir_name).as_posix()
return Path(self.experiment_driver_staging_path, self.trial_dir_name).as_posix()

@property
def trial_working_directory(self) -> str:
Expand All @@ -672,11 +679,7 @@ def trial_working_directory(self) -> str:
"setting `trial_dir_name`"
)
return Path(
_get_ray_train_session_dir(),
self._timestamp,
self.experiment_dir_name,
"working_dirs",
self.trial_dir_name,
self._get_session_path(), "working_dirs", self.trial_dir_name
).as_posix()

@property
Expand Down
6 changes: 1 addition & 5 deletions python/ray/train/tests/test_result.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,11 +65,7 @@ def build_dummy_tuner(configs):

@pytest.mark.parametrize("storage", ["local", "remote"])
@pytest.mark.parametrize("mode", ["trainer", "tuner"])
def test_result_restore(
ray_start_4_cpus, monkeypatch, tmpdir, mock_s3_bucket_uri, storage, mode
):
monkeypatch.setenv("RAY_AIR_LOCAL_CACHE_DIR", str(tmpdir / "ray_results"))

def test_result_restore(ray_start_4_cpus, tmpdir, mock_s3_bucket_uri, storage, mode):
NUM_ITERATIONS = 5
NUM_CHECKPOINTS = 3
if storage == "local":
Expand Down
20 changes: 0 additions & 20 deletions python/ray/train/tests/test_storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,13 +34,6 @@ def storage(request, tmp_path) -> StorageContext:
)


@pytest.fixture(autouse=True)
def local_path(tmp_path, monkeypatch):
local_dir = str(tmp_path / "ray_results")
monkeypatch.setenv("RAY_AIR_LOCAL_CACHE_DIR", local_dir)
yield local_dir


@pytest.fixture(autouse=True, scope="module")
def ray_init():
# NOTE: This is needed to set the `/tmp/ray/session_*` directory.
Expand Down Expand Up @@ -171,14 +164,6 @@ def test_persist_artifacts(storage: StorageContext):
trial_working_dir.joinpath("1.txt").touch()

storage.persist_artifacts()

if not storage.syncer:
# No syncing is needed -- pass early if storage_path == storage_local_path
assert _list_at_fs_path(storage.storage_filesystem, storage.trial_fs_path) == [
"1.txt"
]
return

storage.syncer.wait()

assert sorted(
Expand Down Expand Up @@ -206,11 +191,6 @@ def test_persist_artifacts(storage: StorageContext):

def test_persist_artifacts_failures(storage: StorageContext):
"""Tests `StorageContext.persist_artifacts` edge cases (empty directory)."""
if not storage.syncer:
# Should be a no-op if storage_path == storage_local_path (no syncing needed)
storage.persist_artifacts()
return

# Uploading before the trial directory has been created should fail
with pytest.raises(FileNotFoundError):
storage.persist_artifacts()
Expand Down
Loading
Loading