-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][state] Task events backend - gcs update path and interface [2/n] #30979
[core][state] Task events backend - gcs update path and interface [2/n] #30979
Conversation
Signed-off-by: rickyyx <[email protected]>
…-backend-gcs-update
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me have additional look later.
/// Returns the io_service. | ||
/// | ||
/// \return Reference to its io_service. | ||
instrumented_io_context &GetIoContext() { return io_service_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const instrumented_io_context &GetIoContext() const { return io_service_; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm - this needs to be an non-const reference since the caller takes a non-const reference.
} | ||
} | ||
|
||
absl::optional<rpc::TaskEvents> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider returning bool? I think the dropped counter can be tracked within storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the drop count could actually be tracked outside of the storage, such that the storage could be better decoupled w/o knowing the details of the events (be it status updates or profile events). But open to move the tracking in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test! Consider adding (1) DebugString() to expose in debug state, and also (2) prometheus gauges for the count/bytes added / currently stored in the task stats storage.
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but consider splitting into separate metrics for reported, stored, dropped.
Signed-off-by: rickyyx <[email protected]>
…-backend-gcs-update
…n] (ray-project#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] ray-project#30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker ray-project#30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] ray-project#30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task. Signed-off-by: Weichen Xu <[email protected]>
**Previous PRs:** - #30829: - #30953: - #30867: - #30979: - #30934 - #31207 **This PR:** With the change, - `list_tasks` now will return tasks with attempt number as an additional column. - `get_task` might return multiple task attempt entries if there are retries. There is also some plumbing in the test and in core (esp in the test logic) given the changes. Major changes in the PR are: - Add limit support to `GcsTaskManager` - Change the state aggregator to get tasks from GCS.
…n] (#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] #30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker #30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] #30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task.
**Previous PRs:** - #30829: - #30953: - #30867: - #30979: - #30934 - #31207 **This PR:** With the change, - `list_tasks` now will return tasks with attempt number as an additional column. - `get_task` might return multiple task attempt entries if there are retries. There is also some plumbing in the test and in core (esp in the test logic) given the changes. Major changes in the PR are: - Add limit support to `GcsTaskManager` - Change the state aggregator to get tasks from GCS.
…n] (ray-project#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] ray-project#30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker ray-project#30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] ray-project#30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task. Signed-off-by: tmynn <[email protected]>
…ect#31247) **Previous PRs:** - ray-project#30829: - ray-project#30953: - ray-project#30867: - ray-project#30979: - ray-project#30934 - ray-project#31207 **In This PR:** - Remove old code for timeline/profiling backend. Signed-off-by: tmynn <[email protected]>
**Previous PRs:** - ray-project#30829: - ray-project#30953: - ray-project#30867: - ray-project#30979: - ray-project#30934 - ray-project#31207 **This PR:** With the change, - `list_tasks` now will return tasks with attempt number as an additional column. - `get_task` might return multiple task attempt entries if there are retries. There is also some plumbing in the test and in core (esp in the test logic) given the changes. Major changes in the PR are: - Add limit support to `GcsTaskManager` - Change the state aggregator to get tasks from GCS. Signed-off-by: tmynn <[email protected]>
Why are these changes needed?
Previous PRs:
TaskEventBuffer
implementationIn this PR:
GcsTaskManager
that stores the task events on the GCS side.GcsTsakManager
has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service.Next PRs:
GcsTaskManager
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.