Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Task events backend - port profile events and turn on task backend [4/n] #31207

Merged
merged 58 commits into from
Dec 20, 2022

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Dec 19, 2022

Why are these changes needed?

Previous PRs:

In this PR:

  • New task backend on by default.
  • Changed the profiling event's generation in profiling.cc
    • Removed Profiler to use TaskEventsBuffer directly when emitting ProfileEvent
  • Changed front-end (ray.timeline) to get profiling events from TaskEvents in GCS.

In Future PRs:

  1. Remove legacy code for profiling events
  2. State API porting.

Related issue number

Checks

RAY_PROFILING=1 pytest python/ray/tests/test_advanced.py::test_profiling_api -vs

image

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
@rickyyx rickyyx changed the title [core][state] Task events backend - port profile events [4/n] [core][state] Task events backend - port profile events and turn on task backend [4/n] Dec 20, 2022
Signed-off-by: rickyyx <[email protected]>
@rickyyx
Copy link
Contributor Author

rickyyx commented Dec 20, 2022

Some selected benchmark results from the run: https://buildkite.com/ray-project/release-tests-pr/builds/24015#_

tldr: A few microbenchmark metrics have some regression from up to 15%, while most of the e2e tests see a negligible impact.

Metric Master PR
microbenchmark
single_client_tasks_sync 1252 1238
single_client_tasks_async 11180 10648
multi_client_tasks_async 33785 28934
1_1_actor_calls_sync 2155 1881
1_1_actor_calls_async 7061 6926
1_n_actor_calls_async 10689 10822
n_n_actor_calls_async 34513 30988
n_n_async_actor_calls_async 28600 25367
stress_many_tasks
stage_0_time 8.67 7.70
stage_1_avg_iteration_time 22.27 22.87
stage_2_avg_iteration_time 55.53 51.76
stage_3_creation_time 0.052 0.053
many_tasks
tasks_per_second 27.11 26.67
used_cpus_by_deadline 1759 1753
many_actors
actors_per_second 292.82 435.88
single_node
args_time 17.02 17.06
returns_time 6.11 6.38
get_time 25.07 25.47
queued_time 198 205
large_object_time 259 251
many_nodes
tasks_per_second 9.38 9.15

@rickyyx rickyyx added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Dec 20, 2022
@rickyyx
Copy link
Contributor Author

rickyyx commented Dec 20, 2022

Test failures unrelated (some python dependency issue + memory_monitor_test that's still flaky)

@ericl ericl merged commit 53f68cd into ray-project:master Dec 20, 2022
@ericl
Copy link
Contributor

ericl commented Dec 20, 2022

LGTM

@rickyyx rickyyx assigned rickyyx and unassigned ericl and rkooo567 Dec 20, 2022
ericl pushed a commit that referenced this pull request Dec 21, 2022
**Previous PRs:**
 - #30829: 
 - #30953: 
 - #30867: 
 - #30979: 
 - #30934
 - #31207

**In This PR:** 
- Remove old code for timeline/profiling backend.
ericl pushed a commit that referenced this pull request Dec 23, 2022
**Previous PRs:**
 - #30829: 
 - #30953: 
 - #30867: 
 - #30979: 
 - #30934
 - #31207
**This PR:** 

With the change, 
- `list_tasks` now will return tasks with attempt number as an additional column. 
- `get_task` might return multiple task attempt entries if there are retries. 


There is also some plumbing in the test and in core (esp  in the test logic) given the changes. Major changes in the PR are: 
- Add limit support to `GcsTaskManager`
- Change the state aggregator to get tasks from GCS.
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
**Previous PRs:**
 - #30829: 
 - #30953: 
 - #30867: 
 - #30979: 
 - #30934
 - #31207

**In This PR:** 
- Remove old code for timeline/profiling backend.
AmeerHajAli pushed a commit that referenced this pull request Jan 12, 2023
**Previous PRs:**
 - #30829: 
 - #30953: 
 - #30867: 
 - #30979: 
 - #30934
 - #31207
**This PR:** 

With the change, 
- `list_tasks` now will return tasks with attempt number as an additional column. 
- `get_task` might return multiple task attempt entries if there are retries. 


There is also some plumbing in the test and in core (esp  in the test logic) given the changes. Major changes in the PR are: 
- Add limit support to `GcsTaskManager`
- Change the state aggregator to get tasks from GCS.
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
**Previous PRs:**
 - ray-project#30829: 
 - ray-project#30953: 
 - ray-project#30867: 
 - ray-project#30979: 
 - ray-project#30934
 - ray-project#31207
**This PR:** 

With the change, 
- `list_tasks` now will return tasks with attempt number as an additional column. 
- `get_task` might return multiple task attempt entries if there are retries. 


There is also some plumbing in the test and in core (esp  in the test logic) given the changes. Major changes in the PR are: 
- Add limit support to `GcsTaskManager`
- Change the state aggregator to get tasks from GCS.

Signed-off-by: tmynn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants