-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][state] Task events backend - split drop count on worker #30953
Conversation
Signed-off-by: rickyyx <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have profile events yet right?
Right - it will be part of porting the profile events pr later. Basically changing how the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine, but what is the motivation for splitting it out? Seems like we'd probably be dropping both if dropping any, not sure if there's a meaningful distinction to draw.
also curious about ^. Could be useful if we have different buffers for profiling events and state change events, but iiuc that's not our plan? |
I guess the motivation is mainly that assuming we will have independent queries for profile events, and for task status. So we might want to let users know the number of task status changes dropped rather than the total number (including profile events) when they query for task status changes. If this makes sense?
Yeah, this could be possible if we want to enforce limit on each type of events in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cpp test failures
Looks like test failure is due to this #30573 (comment) |
…n] (#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] #30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker #30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] #30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task.
Signed-off-by: Weichen Xu <[email protected]>
…n] (ray-project#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] ray-project#30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker ray-project#30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] ray-project#30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task. Signed-off-by: Weichen Xu <[email protected]>
**Previous PRs:** - #30829: - #30953: - #30867: - #30979: - #30934 - #31207 **This PR:** With the change, - `list_tasks` now will return tasks with attempt number as an additional column. - `get_task` might return multiple task attempt entries if there are retries. There is also some plumbing in the test and in core (esp in the test logic) given the changes. Major changes in the PR are: - Add limit support to `GcsTaskManager` - Change the state aggregator to get tasks from GCS.
…n] (#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] #30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker #30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] #30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task.
**Previous PRs:** - #30829: - #30953: - #30867: - #30979: - #30934 - #31207 **This PR:** With the change, - `list_tasks` now will return tasks with attempt number as an additional column. - `get_task` might return multiple task attempt entries if there are retries. There is also some plumbing in the test and in core (esp in the test logic) given the changes. Major changes in the PR are: - Add limit support to `GcsTaskManager` - Change the state aggregator to get tasks from GCS.
Signed-off-by: tmynn <[email protected]>
…n] (ray-project#30979) Previous PRs: [core][state] Task events backend: config and interface definitions [0/n] ray-project#30829: Interface and protobuf definitions. [core][state] Task events backend - split drop count on worker ray-project#30953: Splitting of drop count for various events type on worker. [core][state] Task events backend - worker task event buffer implementation [1/n] ray-project#30867: TaskEventBuffer implementation In this PR: Added GcsTaskManager that stores the task events on the GCS side. The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service. Implementation for the update path. Interface for the read path. Next PRs: Implementation for the update path of GcsTaskManager Porting of profiling events Porting of state api task. Signed-off-by: tmynn <[email protected]>
…ect#31247) **Previous PRs:** - ray-project#30829: - ray-project#30953: - ray-project#30867: - ray-project#30979: - ray-project#30934 - ray-project#31207 **In This PR:** - Remove old code for timeline/profiling backend. Signed-off-by: tmynn <[email protected]>
**Previous PRs:** - ray-project#30829: - ray-project#30953: - ray-project#30867: - ray-project#30979: - ray-project#30934 - ray-project#31207 **This PR:** With the change, - `list_tasks` now will return tasks with attempt number as an additional column. - `get_task` might return multiple task attempt entries if there are retries. There is also some plumbing in the test and in core (esp in the test logic) given the changes. Major changes in the PR are: - Add limit support to `GcsTaskManager` - Change the state aggregator to get tasks from GCS. Signed-off-by: tmynn <[email protected]>
Signed-off-by: rickyyx [email protected]
Why are these changes needed?
rpc::TaskEventData
' has non-deterministic order due to an intermediate merge withflat_hash_map
.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.