Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Microbench regression with task backend investigation #31546

Closed
rickyyx opened this issue Jan 9, 2023 · 5 comments
Closed

[core][state] Microbench regression with task backend investigation #31546

rickyyx opened this issue Jan 9, 2023 · 5 comments
Assignees
Labels
observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks Ray 2.4

Comments

@rickyyx
Copy link
Contributor

rickyyx commented Jan 9, 2023

What happened + What you expected to happen

image

Investigate further if we could resolve the regression or identify the root cause (which function call/callsite causes this)

Versions / Dependencies

master

Reproduction script

NA

Issue Severity

None

@rickyyx rickyyx added P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core labels Jan 9, 2023
@rickyyx rickyyx added this to the Ray State Observability milestone Jan 9, 2023
@rickyyx rickyyx self-assigned this Jan 9, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Jan 12, 2023

With the current microbenchmark setup, we will start as many workers as possible, and thus multi threads on a single CoreWorker process (driver/worker) will actually contend for physical CPUs.

With less num_cpus made available to the ray cluster, the regression is much lower <5% due to the added work at submitting work (the owner task is usually CPU bounded)

Also, given the speciality of the microbenchmark (large number of no-op tasks), the owner is actually the bottleneck, which is different from many of the more realistic workloads, e.g. many_tasks and stress_test_many_tasks, many_actors and stress_test_many_actors

Could we close this since the root cause is identified (thread contention among workers) and the impact to more realistic workload is low?

cc @scv119 @rkooo567

@rkooo567
Copy link
Contributor

This sgtm. What do you think @scv119 ?

@rkooo567
Copy link
Contributor

Actually I am curious if we restrict the # of task event we can report per second, it will get better. Maybe we can experiment with 1000 tasks/s max batch after we merge the batch PR?

@richardliaw richardliaw added the release-blocker P0 Issue that blocks the release label Jan 26, 2023
@rickyyx rickyyx added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release labels Jan 26, 2023
@rickyyx
Copy link
Contributor Author

rickyyx commented Jan 26, 2023

Not a release blocker - mainly needs some experimentation and further validation.

@scv119 scv119 added the observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling label Feb 16, 2023
@rickyyx rickyyx added Ray 2.4 and removed core Issues that should be addressed in Ray Core labels Feb 22, 2023
@rickyyx rickyyx closed this as completed Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling P1 Issue that should be fixed within a few weeks Ray 2.4
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants