Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train][Observability] Track Train Run Info with
TrainStateActor
#44585[Train][Observability] Track Train Run Info with
TrainStateActor
#44585Changes from 12 commits
9997c2c
5141441
3f48f50
4a7657e
d9e69cb
dfe6e9d
7c87556
ce168b2
da9e812
3c868e2
5a86c59
e044776
7359531
c1ca02a
bab8c0b
ead3f62
7a27dc9
787dc09
8fad340
a9e47f7
86b40f9
b92d05f
5713d65
ac9b42b
22a0606
218713c
843d748
c627925
486ac2c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the plan name if we have the UUID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to observability team, the
dataset_tag
is they primary key they use in Dashboard.Also, Data team are using
dataset_tag = f"{plan_name}_{plan_uuid}
as a unique id for a dataset.ray/python/ray/data/_internal/plan.py
Line 510 in 98e51b5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separating
plan_name
andplan_uuid
also makes it easy to support stream_split_op in the future.ray/python/ray/data/_internal/iterator/stream_split_iterator.py
Lines 119 to 124 in 98e51b5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we ask Ray Data team for a better way to get the tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consideration for elastic training: workers might get added and removed from failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We can expose an API in
TrainStatsActor
to update the TrainRunInfo during training.For schema, we can add a
dead_workers
entry to track those evicted workers.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading over this, I realized there are no stats, and this points to a bunch of
Train...Info
objects. I wonder if there is a word that we can use instead ofStats
/Info
that captures/unifies these two.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently there's only static metadata stored in this actor, and in the future there will be stats.
What about
State
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to make sure all the pydantic related code is encapsulated in
TrainRunStatsManager
. So that OSS users who are not usingray[default]
will not get an error.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it feels a bit weird to pass the
WorkerGroup
here, but not sure if there is another cleaner way to organize it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The consideration here is: when we do elastic/fault-tolerant training, we can avoid using an old workergroup if we pass it through the function arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a unique ID that differentiate each run.
trainer.restore
will reuse trial id.