[Train][Observability] Track Train Run Info with `TrainStateActor` #44585

woshiyyya · 2024-04-09T02:03:42Z

Why are these changes needed?

This PR added StateActor to collect Train Run metadata for Train Dashboard.

The main components include:

StatsActor:
- A singleton, detached actor initiated on the head node.
- It receives metrics and metadata from all train runs.
- The Train dashboard periodically retrieves information from this actor.
- Currently it stores only static metadata, but we can also extend it to support real-time metrics in the future.
TrainRunStatsManager:
- A management class instantiated on the controller layer(BackendExecutor here) of each train run.
- It aggregates information from each worker and collects data about the trainer.
Schemas (TrainRunInfo, TrainWorkerInfo, TrainDatasetInfo):
- Pydantic schemas defined for the training run.
- These schemas are immutable, which is constructed and reported only once before training starts.

How to launch the StateActor?

We decide to launch StateActor in the driver instead of the trainable (controller). Below are the reasons behind it:

Resource Limitation
Ray Tune turns placement_group_capture_child_tasks=True, which limit the restricts the total resource a trainable can use.

ray/python/ray/air/execution/resources/placement_group.py

Line 35 in 3606da8

placement_group_capture_child_tasks=True,

In Ray Train, workers and controllers already used up all the resources specified in ScalingConfig, thus no additional actors can be launched in TrainTrainable. Therefore, we cannot launch TrainStateActor in controller, since it requires "node:__head__": 0.001.

Force Cleanup
Ray Tune will force clean up all the actors and subactors launched inside the trainable. Therefore, if we launched the detached actor in the trainable, it will be purged after trainer.fit finished.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: woshiyyya <[email protected]>

Signed-off-by: yunxuanx <[email protected]>

…un_schema

Signed-off-by: yunxuanx <[email protected]>

woshiyyya · 2024-04-16T00:40:59Z

python/ray/train/_internal/stats.py

+    def __init__(self) -> None:
+        self.stats_actor = get_or_launch_stats_actor()
+
+    def register_train_run(


Trying to make sure all the pydantic related code is encapsulated in TrainRunStatsManager. So that OSS users who are not using ray[default] will not get an error.

woshiyyya · 2024-04-16T00:44:17Z

python/ray/tune/trainable/function_trainable.py

@@ -58,6 +59,7 @@ def setup(self, config):
                logdir=self._storage.trial_driver_staging_path,
                driver_ip=None,
                experiment_name=self._storage.experiment_dir_name,
+                run_id=uuid.uuid4().hex,


Create a unique ID that differentiate each run.

Cannot use trial id because trainer.restore will reuse trial id.

Cannot use job id because there could be multiple train runs in one job.

Cannot use trial id + job id because one can restore a run multiple times in one job.

python/ray/train/_internal/stats.py

python/ray/train/_internal/schema.py

Signed-off-by: Yunxuan Xiao <[email protected]>

python/ray/train/_internal/stats.py

Signed-off-by: yunxuanx <[email protected]>

justinvyu

Thanks, looking good!

python/ray/tune/trainable/function_trainable.py

python/ray/train/_internal/stats.py

python/ray/train/_internal/session.py

python/ray/train/_internal/schema.py

python/ray/train/_internal/stats.py

python/ray/train/_internal/backend_executor.py

python/ray/train/_internal/stats.py

python/ray/train/_internal/backend_executor.py

python/ray/train/_internal/schema.py

python/ray/train/_internal/stats.py

matthewdeng · 2024-04-17T03:32:49Z

python/ray/train/_internal/stats.py

+        trial_name: str,
+        trainer_actor_id: str,
+        datasets: Dict[str, Dataset],
+        worker_group: WorkerGroup,


I do think it feels a bit weird to pass the WorkerGroup here, but not sure if there is another cleaner way to organize it.

The consideration here is: when we do elastic/fault-tolerant training, we can avoid using an old workergroup if we pass it through the function arguments.

python/ray/tune/trainable/function_trainable.py

Co-authored-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]>

Signed-off-by: yunxuanx <[email protected]>

justinvyu · 2024-04-19T17:54:44Z

python/ray/train/_internal/_state/state_manager.py

+    This manager class is created on the train controller layer for each run.
+    """
+
+    def __init__(self, worker_group: WorkerGroup) -> None:


I think it is actually better to pass it in register_train_run as before so we don't keep unnecessary state.

What about having a api to update the worker group? We can re-register the worker group in the TrainRunStateManager when we it's updated (e.g. for elastic training)

python/ray/train/constants.py

raulchen · 2024-04-19T19:47:47Z

python/ray/train/_internal/_state/schema.py

+        description="The key of the dataset dict specified in Ray Train Trainer."
+    )
+    plan_name: str = Field(description="The name of the internal dataset plan.")
+    plan_uuid: str = Field(description="The uuid of the internal dataset plan.")


let's name the above 2 as "dataset_name" and "dataset_uuid", we don't want to expose the concept of plan.

also, I think we can prefix train-level dataset name to the data-level dataset name. So it's easier to identify them on the data dashboard.

dataset_key = "train" dataset._set_name(dataset_key + "_" + dataset._name)

OK. I'll update it with dataset_name and dataset_uuid.

For setting the prefix, probably we should do it when we initialize the trainer. I'll post another PR to do this.

raulchen · 2024-04-19T19:50:18Z

python/ray/train/_internal/_state/state_actor.py

+            namespace=TRAIN_STATE_ACTOR_NAMESPACE,
+            get_if_exists=True,
+            lifetime="detached",
+            resources={"node:__internal_head__": 0.001},


instead of forcing it on the head node, it'd be better to force it on the current node. Because we may launch a job driver on a worker node.

scheduling_strategy = NodeAffinitySchedulingStrategy( ray.get_runtime_context().get_node_id(), soft=False, )

Oh, actually the StateActor is tracking all Train Runs, we don't want the dashboard crash because of a worker node crash.

For example, we have a 2 train runs

run A on node 1

run B on node 2.

If we launch StateActor on node 1, and node 1 died, we will not be able to track run B.

python/ray/train/_internal/_state/__init__.py

python/ray/train/_internal/backend_executor.py

python/ray/train/_internal/session.py

Signed-off-by: yunxuanx <[email protected]>

Signed-off-by: woshiyyya <[email protected]>

Signed-off-by: yunxuanx <[email protected]>

python/ray/train/_internal/state/state_actor.py

python/ray/train/_internal/state/state_manager.py

python/ray/train/tests/test_state.py

Signed-off-by: yunxuanx <[email protected]>

justinvyu

Good work! I just left some nits.

justinvyu · 2024-04-26T15:32:13Z

python/ray/train/tests/test_state.py

+    os.environ["RAY_TRAIN_ENABLE_STATE_TRACKING"] = "0"
+    e = BackendExecutor(
+        backend_config=TestConfig(), num_workers=4, resources_per_worker={"GPU": 1}
+    )
+    e.start()


Nit: we can just create a WorkerGroup directly instead of using BackendExecutor.

Changed to WorkerGroup.

I was originally trying to also test the session initialization in this test. But since we already covered that in the end-2-end test, let's go with WorkerGroup instead.

python/ray/train/tests/test_state.py

justinvyu · 2024-04-26T15:35:06Z

python/ray/train/_internal/state/state_actor.py

    return state_actor
+
+
+def get_state_actor():


def get_state_actor() -> Optional[ActorHandle[TrainStateActor]]:

Nice catch. But seems that ActorHandle is not subscriptable since it's not a generic class. I've changed to def get_state_actor() -> Optional[ActorHandle]:

justinvyu · 2024-04-26T15:37:33Z

python/ray/train/tests/test_state.py

+
+
+def test_state_manager(ray_start_gpu_cluster):
+    os.environ["RAY_TRAIN_ENABLE_STATE_TRACKING"] = "0"


Nit: We don't need this one after removing BackendExecutor, but we should use monkeypatch.setenv(...) fixture so that the env var is only set within the test.

Otherwise it will spill over to the next one. We should also change the next test's env var setting to use monkeypatch.

Ah got it! Good to see that after removing BackendExecutor, we don't need to setenv anymore.

Signed-off-by: yunxuanx <[email protected]>

woshiyyya added 3 commits April 8, 2024 12:05

add lightning version restrictions

9997c2c

Signed-off-by: woshiyyya <[email protected]>

init

5141441

Signed-off-by: woshiyyya <[email protected]>

fix

3f48f50

Signed-off-by: woshiyyya <[email protected]>

woshiyyya requested review from matthewdeng and justinvyu as code owners April 9, 2024 02:03

update

4a7657e

Signed-off-by: woshiyyya <[email protected]>

woshiyyya marked this pull request as draft April 11, 2024 01:02

woshiyyya added 8 commits April 15, 2024 20:52

start state actor

d9e69cb

Signed-off-by: yunxuanx <[email protected]>

fix dataset id

dfe6e9d

Signed-off-by: yunxuanx <[email protected]>

add dataset schema

7c87556

Signed-off-by: yunxuanx <[email protected]>

Merge remote-tracking branch 'upstream/master' into train/add_train_r…

ce168b2

…un_schema

change run_id to trial_info

da9e812

Signed-off-by: yunxuanx <[email protected]>

add trial run id property

3c868e2

Signed-off-by: yunxuanx <[email protected]>

add TrainStatManager

5a86c59

Signed-off-by: yunxuanx <[email protected]>

update TrainRunStatsManager

e044776

Signed-off-by: yunxuanx <[email protected]>

woshiyyya commented Apr 16, 2024

View reviewed changes

python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved

woshiyyya commented Apr 16, 2024

View reviewed changes

python/ray/train/_internal/schema.py Outdated Show resolved Hide resolved

Merge branch 'master' into train/add_train_run_schema

7359531

Signed-off-by: Yunxuan Xiao <[email protected]>

woshiyyya marked this pull request as ready for review April 16, 2024 00:51

woshiyyya commented Apr 16, 2024

View reviewed changes

python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved

woshiyyya commented Apr 16, 2024

View reviewed changes

python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved

woshiyyya assigned justinvyu and matthewdeng Apr 16, 2024

woshiyyya changed the title ~~[Draft] Add Ray Train StateActor~~ [Train][Observability] Train Dashboard Backend Apr 16, 2024

woshiyyya added 3 commits April 16, 2024 18:46

fix import

c1ca02a

Signed-off-by: yunxuanx <[email protected]>

fix ci

bab8c0b

Signed-off-by: yunxuanx <[email protected]>

fix ci

ead3f62

Signed-off-by: yunxuanx <[email protected]>

justinvyu reviewed Apr 16, 2024

View reviewed changes

matthewdeng reviewed Apr 17, 2024

View reviewed changes

raulchen self-assigned this Apr 17, 2024

woshiyyya and others added 3 commits April 17, 2024 14:28

Apply suggestions from code review

7a27dc9

Co-authored-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]>

update code structure

787dc09

Signed-off-by: yunxuanx <[email protected]>

fix typo

8fad340

Signed-off-by: yunxuanx <[email protected]>

woshiyyya changed the title ~~[Train][Observability] Train Dashboard Backend~~ [Train][Observability] Track Train Run Info with TrainStateActor Apr 17, 2024

fix lint

a9e47f7

Signed-off-by: yunxuanx <[email protected]>

justinvyu reviewed Apr 19, 2024

View reviewed changes

raulchen approved these changes Apr 19, 2024

View reviewed changes

woshiyyya and others added 6 commits April 22, 2024 21:57

address comments

86b40f9

Signed-off-by: yunxuanx <[email protected]>

fix

b92d05f

Signed-off-by: yunxuanx <[email protected]>

fix circular import

5713d65

Signed-off-by: yunxuanx <[email protected]>

add test & launch stateactor on driver

ac9b42b

Signed-off-by: woshiyyya <[email protected]>

Merge branch 'master' into train/add_train_run_schema

22a0606

update tests

218713c

Signed-off-by: yunxuanx <[email protected]>

woshiyyya requested review from matthewdeng and justinvyu April 24, 2024 23:36

justinvyu reviewed Apr 25, 2024

View reviewed changes

add state manager test

843d748

Signed-off-by: yunxuanx <[email protected]>

justinvyu approved these changes Apr 26, 2024

View reviewed changes

woshiyyya added 2 commits April 26, 2024 18:46

address comments

c627925

Signed-off-by: yunxuanx <[email protected]>

clean ut import

486ac2c

Signed-off-by: yunxuanx <[email protected]>

justinvyu merged commit 18122ff into ray-project:master Apr 26, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train][Observability] Track Train Run Info with `TrainStateActor` #44585

[Train][Observability] Track Train Run Info with `TrainStateActor` #44585

woshiyyya commented Apr 9, 2024 •

edited

Loading

woshiyyya Apr 16, 2024 •

edited

Loading

woshiyyya Apr 16, 2024 •

edited

Loading

justinvyu left a comment

matthewdeng Apr 17, 2024

woshiyyya Apr 22, 2024 •

edited

Loading

justinvyu Apr 19, 2024

woshiyyya Apr 22, 2024 •

edited

Loading

raulchen Apr 19, 2024

woshiyyya Apr 22, 2024

raulchen Apr 19, 2024

woshiyyya Apr 22, 2024 •

edited

Loading

justinvyu left a comment

justinvyu Apr 26, 2024

woshiyyya Apr 26, 2024

justinvyu Apr 26, 2024

woshiyyya Apr 26, 2024

justinvyu Apr 26, 2024

woshiyyya Apr 26, 2024



		def test_state_manager(ray_start_gpu_cluster):
		os.environ["RAY_TRAIN_ENABLE_STATE_TRACKING"] = "0"

[Train][Observability] Track Train Run Info with TrainStateActor #44585

[Train][Observability] Track Train Run Info with TrainStateActor #44585

Conversation

woshiyyya commented Apr 9, 2024 • edited Loading

Why are these changes needed?

How to launch the StateActor?

Related issue number

Checks

woshiyyya Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

woshiyyya Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woshiyyya Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woshiyyya Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woshiyyya Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[Train][Observability] Track Train Run Info with `TrainStateActor` #44585

[Train][Observability] Track Train Run Info with `TrainStateActor` #44585

woshiyyya commented Apr 9, 2024 •

edited

Loading

woshiyyya Apr 16, 2024 •

edited

Loading

woshiyyya Apr 16, 2024 •

edited

Loading

woshiyyya Apr 22, 2024 •

edited

Loading

woshiyyya Apr 22, 2024 •

edited

Loading

woshiyyya Apr 22, 2024 •

edited

Loading