[core][state] Microbench regression with task backend investigation #31546

rickyyx · 2023-01-09T22:48:00Z

What happened + What you expected to happen

Investigate further if we could resolve the regression or identify the root cause (which function call/callsite causes this)

Versions / Dependencies

master

Reproduction script

NA

Issue Severity

None

rickyyx · 2023-01-12T01:02:24Z

With the current microbenchmark setup, we will start as many workers as possible, and thus multi threads on a single CoreWorker process (driver/worker) will actually contend for physical CPUs.

With less num_cpus made available to the ray cluster, the regression is much lower <5% due to the added work at submitting work (the owner task is usually CPU bounded)

Also, given the speciality of the microbenchmark (large number of no-op tasks), the owner is actually the bottleneck, which is different from many of the more realistic workloads, e.g. many_tasks and stress_test_many_tasks, many_actors and stress_test_many_actors

Could we close this since the root cause is identified (thread contention among workers) and the impact to more realistic workload is low?

cc @scv119 @rkooo567

rkooo567 · 2023-01-13T01:55:15Z

This sgtm. What do you think @scv119 ?

rkooo567 · 2023-01-23T13:28:55Z

Actually I am curious if we restrict the # of task event we can report per second, it will get better. Maybe we can experiment with 1000 tasks/s max batch after we merge the batch PR?

rickyyx · 2023-01-26T23:09:50Z

Not a release blocker - mainly needs some experimentation and further validation.

rickyyx · 2023-04-03T17:44:50Z

This is done as an effort of the following PRs:

rickyyx added P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core labels Jan 9, 2023

rickyyx added this to the Ray State Observability milestone Jan 9, 2023

rickyyx self-assigned this Jan 9, 2023

rickyyx mentioned this issue Jan 12, 2023

[core][state] Optimize task event buffer w/o merging TaskEvents #31617

Closed

7 tasks

richardliaw added the release-blocker P0 Issue that blocks the release label Jan 26, 2023

rickyyx added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release labels Jan 26, 2023

scv119 added the observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling label Feb 16, 2023

rickyyx added Ray 2.4 and removed core Issues that should be addressed in Ray Core labels Feb 22, 2023

rickyyx closed this as completed Apr 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][state] Microbench regression with task backend investigation #31546

[core][state] Microbench regression with task backend investigation #31546

rickyyx commented Jan 9, 2023

rickyyx commented Jan 12, 2023

rkooo567 commented Jan 13, 2023

rkooo567 commented Jan 23, 2023

rickyyx commented Jan 26, 2023

rickyyx commented Apr 3, 2023

[core][state] Microbench regression with task backend investigation #31546

[core][state] Microbench regression with task backend investigation #31546

Comments

rickyyx commented Jan 9, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

rickyyx commented Jan 12, 2023

rkooo567 commented Jan 13, 2023

rkooo567 commented Jan 23, 2023

rickyyx commented Jan 26, 2023

rickyyx commented Apr 3, 2023