Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Task events backend - split drop count on worker #30953

Merged
merged 1 commit into from
Dec 8, 2022

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Dec 7, 2022

Signed-off-by: rickyyx [email protected]

Why are these changes needed?

  • We were reporting dropped task event count as a single value. When queried for a specific kind of events, e.g. profile events, the number of dropped events for status updates should not interfere the profile events results.
  • Fix testing: task events in exported rpc::TaskEventData' has non-deterministic order due to an intermediate merge with flat_hash_map.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: rickyyx <[email protected]>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have profile events yet right?

@rickyyx
Copy link
Contributor Author

rickyyx commented Dec 7, 2022

we don't have profile events yet right?

Right - it will be part of porting the profile events pr later. Basically changing how the current Profiler works.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine, but what is the motivation for splitting it out? Seems like we'd probably be dropping both if dropping any, not sure if there's a meaningful distinction to draw.

@rkooo567
Copy link
Contributor

rkooo567 commented Dec 8, 2022

also curious about ^. Could be useful if we have different buffers for profiling events and state change events, but iiuc that's not our plan?

@rickyyx
Copy link
Contributor Author

rickyyx commented Dec 8, 2022

The code looks fine, but what is the motivation for splitting it out? Seems like we'd probably be dropping both if dropping any, not sure if there's a meaningful distinction to draw.

I guess the motivation is mainly that assuming we will have independent queries for profile events, and for task status. So we might want to let users know the number of task status changes dropped rather than the total number (including profile events) when they query for task status changes. If this makes sense?

Could be useful if we have different buffers for profiling events and state change events, but iiuc that's not our plan?

Yeah, this could be possible if we want to enforce limit on each type of events in the future.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cpp test failures

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 8, 2022
@rickyyx rickyyx added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Dec 8, 2022
@rickyyx
Copy link
Contributor Author

rickyyx commented Dec 8, 2022

Looks like test failure is due to this #30573 (comment)

@rickyyx rickyyx removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 8, 2022
@ericl ericl merged commit 259b5cf into ray-project:master Dec 8, 2022
rkooo567 pushed a commit that referenced this pull request Dec 14, 2022
…n] (#30979)

Previous PRs:

[core][state] Task events backend: config and interface definitions [0/n]  #30829: Interface and protobuf definitions.
[core][state] Task events backend - split drop count on worker #30953: Splitting of drop count for various events type on worker.
[core][state] Task events backend - worker task event buffer implementation [1/n] #30867: TaskEventBuffer implementation
In this PR:

Added GcsTaskManager that stores the task events on the GCS side.
The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service.
Implementation for the update path.
Interface for the read path.
Next PRs:

Implementation for the update path of GcsTaskManager
Porting of profiling events
Porting of state api task.
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…n] (ray-project#30979)

Previous PRs:

[core][state] Task events backend: config and interface definitions [0/n]  ray-project#30829: Interface and protobuf definitions.
[core][state] Task events backend - split drop count on worker ray-project#30953: Splitting of drop count for various events type on worker.
[core][state] Task events backend - worker task event buffer implementation [1/n] ray-project#30867: TaskEventBuffer implementation
In this PR:

Added GcsTaskManager that stores the task events on the GCS side.
The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service.
Implementation for the update path.
Interface for the read path.
Next PRs:

Implementation for the update path of GcsTaskManager
Porting of profiling events
Porting of state api task.

Signed-off-by: Weichen Xu <[email protected]>
ericl pushed a commit that referenced this pull request Dec 21, 2022
**Previous PRs:**
 - #30829: 
 - #30953: 
 - #30867: 
 - #30979: 
 - #30934
 - #31207

**In This PR:** 
- Remove old code for timeline/profiling backend.
ericl pushed a commit that referenced this pull request Dec 23, 2022
**Previous PRs:**
 - #30829: 
 - #30953: 
 - #30867: 
 - #30979: 
 - #30934
 - #31207
**This PR:** 

With the change, 
- `list_tasks` now will return tasks with attempt number as an additional column. 
- `get_task` might return multiple task attempt entries if there are retries. 


There is also some plumbing in the test and in core (esp  in the test logic) given the changes. Major changes in the PR are: 
- Add limit support to `GcsTaskManager`
- Change the state aggregator to get tasks from GCS.
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
…n] (#30979)

Previous PRs:

[core][state] Task events backend: config and interface definitions [0/n]  #30829: Interface and protobuf definitions.
[core][state] Task events backend - split drop count on worker #30953: Splitting of drop count for various events type on worker.
[core][state] Task events backend - worker task event buffer implementation [1/n] #30867: TaskEventBuffer implementation
In this PR:

Added GcsTaskManager that stores the task events on the GCS side.
The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service.
Implementation for the update path.
Interface for the read path.
Next PRs:

Implementation for the update path of GcsTaskManager
Porting of profiling events
Porting of state api task.
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
**Previous PRs:**
 - #30829: 
 - #30953: 
 - #30867: 
 - #30979: 
 - #30934
 - #31207

**In This PR:** 
- Remove old code for timeline/profiling backend.
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
**Previous PRs:**
 - #30829: 
 - #30953: 
 - #30867: 
 - #30979: 
 - #30934
 - #31207
**This PR:** 

With the change, 
- `list_tasks` now will return tasks with attempt number as an additional column. 
- `get_task` might return multiple task attempt entries if there are retries. 


There is also some plumbing in the test and in core (esp  in the test logic) given the changes. Major changes in the PR are: 
- Add limit support to `GcsTaskManager`
- Change the state aggregator to get tasks from GCS.
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
…n] (ray-project#30979)

Previous PRs:

[core][state] Task events backend: config and interface definitions [0/n]  ray-project#30829: Interface and protobuf definitions.
[core][state] Task events backend - split drop count on worker ray-project#30953: Splitting of drop count for various events type on worker.
[core][state] Task events backend - worker task event buffer implementation [1/n] ray-project#30867: TaskEventBuffer implementation
In this PR:

Added GcsTaskManager that stores the task events on the GCS side.
The GcsTsakManager has its own io service and io thread that's separated from the main rpc thread/io_context. Handling of rpcs will be posted to its own internal io_service.
Implementation for the update path.
Interface for the read path.
Next PRs:

Implementation for the update path of GcsTaskManager
Porting of profiling events
Porting of state api task.

Signed-off-by: tmynn <[email protected]>
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
**Previous PRs:**
 - ray-project#30829: 
 - ray-project#30953: 
 - ray-project#30867: 
 - ray-project#30979: 
 - ray-project#30934
 - ray-project#31207
**This PR:** 

With the change, 
- `list_tasks` now will return tasks with attempt number as an additional column. 
- `get_task` might return multiple task attempt entries if there are retries. 


There is also some plumbing in the test and in core (esp  in the test logic) given the changes. Major changes in the PR are: 
- Add limit support to `GcsTaskManager`
- Change the state aggregator to get tasks from GCS.

Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants