Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train][Observability] Track Train Run Info with TrainStateActor #44585

Merged
merged 29 commits into from
Apr 26, 2024

Conversation

woshiyyya
Copy link
Member

@woshiyyya woshiyyya commented Apr 9, 2024

Why are these changes needed?

This PR added StateActor to collect Train Run metadata for Train Dashboard.

The main components include:

  • StatsActor:
    • A singleton, detached actor initiated on the head node.
    • It receives metrics and metadata from all train runs.
    • The Train dashboard periodically retrieves information from this actor.
    • Currently it stores only static metadata, but we can also extend it to support real-time metrics in the future.
  • TrainRunStatsManager:
    • A management class instantiated on the controller layer(BackendExecutor here) of each train run.
    • It aggregates information from each worker and collects data about the trainer.
  • Schemas (TrainRunInfo, TrainWorkerInfo, TrainDatasetInfo):
    • Pydantic schemas defined for the training run.
    • These schemas are immutable, which is constructed and reported only once before training starts.

How to launch the StateActor?

We decide to launch StateActor in the driver instead of the trainable (controller). Below are the reasons behind it:

Resource Limitation
Ray Tune turns placement_group_capture_child_tasks=True, which limit the restricts the total resource a trainable can use.

placement_group_capture_child_tasks=True,

In Ray Train, workers and controllers already used up all the resources specified in ScalingConfig, thus no additional actors can be launched in TrainTrainable. Therefore, we cannot launch TrainStateActor in controller, since it requires "node:__head__": 0.001.

Force Cleanup
Ray Tune will force clean up all the actors and subactors launched inside the trainable. Therefore, if we launched the detached actor in the trainable, it will be purged after trainer.fit finished.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
Signed-off-by: woshiyyya <[email protected]>
@woshiyyya woshiyyya marked this pull request as draft April 11, 2024 01:02
def __init__(self) -> None:
self.stats_actor = get_or_launch_stats_actor()

def register_train_run(
Copy link
Member Author

@woshiyyya woshiyyya Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to make sure all the pydantic related code is encapsulated in TrainRunStatsManager. So that OSS users who are not using ray[default] will not get an error.

@@ -58,6 +59,7 @@ def setup(self, config):
logdir=self._storage.trial_driver_staging_path,
driver_ip=None,
experiment_name=self._storage.experiment_dir_name,
run_id=uuid.uuid4().hex,
Copy link
Member Author

@woshiyyya woshiyyya Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a unique ID that differentiate each run.

  • Cannot use trial id because trainer.restore will reuse trial id.
  • Cannot use job id because there could be multiple train runs in one job.
  • Cannot use trial id + job id because one can restore a run multiple times in one job.

@woshiyyya woshiyyya marked this pull request as ready for review April 16, 2024 00:51
@woshiyyya woshiyyya changed the title [Draft] Add Ray Train StateActor [Train][Observability] Train Dashboard Backend Apr 16, 2024
Signed-off-by: yunxuanx <[email protected]>
Signed-off-by: yunxuanx <[email protected]>
Signed-off-by: yunxuanx <[email protected]>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looking good!

python/ray/tune/trainable/function_trainable.py Outdated Show resolved Hide resolved
python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/train/_internal/session.py Show resolved Hide resolved
python/ray/train/_internal/schema.py Outdated Show resolved Hide resolved
python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/train/_internal/backend_executor.py Outdated Show resolved Hide resolved
python/ray/train/_internal/schema.py Outdated Show resolved Hide resolved
python/ray/train/_internal/schema.py Outdated Show resolved Hide resolved
python/ray/train/_internal/schema.py Outdated Show resolved Hide resolved
python/ray/train/_internal/schema.py Outdated Show resolved Hide resolved
python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved
python/ray/train/_internal/stats.py Outdated Show resolved Hide resolved
trial_name: str,
trainer_actor_id: str,
datasets: Dict[str, Dataset],
worker_group: WorkerGroup,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it feels a bit weird to pass the WorkerGroup here, but not sure if there is another cleaner way to organize it.

Copy link
Member Author

@woshiyyya woshiyyya Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The consideration here is: when we do elastic/fault-tolerant training, we can avoid using an old workergroup if we pass it through the function arguments.

python/ray/tune/trainable/function_trainable.py Outdated Show resolved Hide resolved
@raulchen raulchen self-assigned this Apr 17, 2024
woshiyyya and others added 3 commits April 17, 2024 14:28
Co-authored-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]>
Signed-off-by: Yunxuan Xiao <[email protected]>
Signed-off-by: yunxuanx <[email protected]>
Signed-off-by: yunxuanx <[email protected]>
@woshiyyya woshiyyya changed the title [Train][Observability] Train Dashboard Backend [Train][Observability] Track Train Run Info with TrainStateActor Apr 17, 2024
Signed-off-by: yunxuanx <[email protected]>
This manager class is created on the train controller layer for each run.
"""

def __init__(self, worker_group: WorkerGroup) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is actually better to pass it in register_train_run as before so we don't keep unnecessary state.

Copy link
Member Author

@woshiyyya woshiyyya Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about having a api to update the worker group? We can re-register the worker group in the TrainRunStateManager when we it's updated (e.g. for elastic training)

python/ray/train/constants.py Outdated Show resolved Hide resolved
description="The key of the dataset dict specified in Ray Train Trainer."
)
plan_name: str = Field(description="The name of the internal dataset plan.")
plan_uuid: str = Field(description="The uuid of the internal dataset plan.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • let's name the above 2 as "dataset_name" and "dataset_uuid", we don't want to expose the concept of plan.
  • also, I think we can prefix train-level dataset name to the data-level dataset name. So it's easier to identify them on the data dashboard.
dataset_key = "train"
dataset._set_name(dataset_key + "_" + dataset._name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll update it with dataset_name and dataset_uuid.

For setting the prefix, probably we should do it when we initialize the trainer. I'll post another PR to do this.

namespace=TRAIN_STATE_ACTOR_NAMESPACE,
get_if_exists=True,
lifetime="detached",
resources={"node:__internal_head__": 0.001},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of forcing it on the head node, it'd be better to force it on the current node. Because we may launch a job driver on a worker node.

        scheduling_strategy = NodeAffinitySchedulingStrategy(
            ray.get_runtime_context().get_node_id(),
            soft=False,
        )

Copy link
Member Author

@woshiyyya woshiyyya Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, actually the StateActor is tracking all Train Runs, we don't want the dashboard crash because of a worker node crash.

For example, we have a 2 train runs

  • run A on node 1
  • run B on node 2.

If we launch StateActor on node 1, and node 1 died, we will not be able to track run B.

python/ray/train/_internal/_state/__init__.py Outdated Show resolved Hide resolved
python/ray/train/_internal/session.py Outdated Show resolved Hide resolved
python/ray/train/_internal/state/state_actor.py Outdated Show resolved Hide resolved
python/ray/train/_internal/state/state_manager.py Outdated Show resolved Hide resolved
python/ray/train/_internal/state/state_manager.py Outdated Show resolved Hide resolved
python/ray/train/tests/test_state.py Show resolved Hide resolved
Signed-off-by: yunxuanx <[email protected]>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! I just left some nits.

Comment on lines 177 to 181
os.environ["RAY_TRAIN_ENABLE_STATE_TRACKING"] = "0"
e = BackendExecutor(
backend_config=TestConfig(), num_workers=4, resources_per_worker={"GPU": 1}
)
e.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can just create a WorkerGroup directly instead of using BackendExecutor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to WorkerGroup.

I was originally trying to also test the session initialization in this test. But since we already covered that in the end-2-end test, let's go with WorkerGroup instead.

python/ray/train/tests/test_state.py Outdated Show resolved Hide resolved
return state_actor


def get_state_actor():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def get_state_actor() -> Optional[ActorHandle[TrainStateActor]]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. But seems that ActorHandle is not subscriptable since it's not a generic class. I've changed to def get_state_actor() -> Optional[ActorHandle]:



def test_state_manager(ray_start_gpu_cluster):
os.environ["RAY_TRAIN_ENABLE_STATE_TRACKING"] = "0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We don't need this one after removing BackendExecutor, but we should use monkeypatch.setenv(...) fixture so that the env var is only set within the test.

Otherwise it will spill over to the next one. We should also change the next test's env var setting to use monkeypatch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it! Good to see that after removing BackendExecutor, we don't need to setenv anymore.

Signed-off-by: yunxuanx <[email protected]>
Signed-off-by: yunxuanx <[email protected]>
@justinvyu justinvyu merged commit 18122ff into ray-project:master Apr 26, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants