Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][state] Task Backend - reduce lock contention on debug stats / metric recording on counters. #32355

Merged
merged 3 commits into from
Feb 9, 2023

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Feb 9, 2023

Signed-off-by: rickyyx [email protected]

Why are these changes needed?

When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data.

Global stats: 196276 total (5 active)
Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s
Execution time:  mean = 295.864 us, total = 58.071 s
Event stats:
....
        GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s
        GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s
....

This PR

  1. introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places.
  2. With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread)

Related issue number

Closes #32202

Checks

https://buildkite.com/ray-project/release-tests-pr/builds/27896#_

image

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: rickyyx <[email protected]>
@cadedaniel
Copy link
Member

Can we get this reviewed and merged ASAP? This is on the critical path for a release test run and handoff to product.

@rickyyx rickyyx marked this pull request as ready for review February 9, 2023 17:56
@rickyyx rickyyx requested a review from a team as a code owner February 9, 2023 17:56
@rickyyx
Copy link
Contributor Author

rickyyx commented Feb 9, 2023

Fixing the tsan test.

@rickyyx
Copy link
Contributor Author

rickyyx commented Feb 9, 2023

Together with changes from #32251 it is still rather slow on the stress_test_many_tasks test if I increased the buffer size to 1M and keep the batch size as 10k.

Trying a batch size of 1k to see if it helps.

Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good now, but for the maintainability perspective, I think we should make the whole class thread-safe (IMO partially thread-safe class like this is the worst for maintainability). E.g., if someone touches states within Stop, OnJobFinished, or the storage class, it can cause random issues that are difficult to debug.

So let's merge it first and make another PR to make the whole class thread-safe.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Feb 9, 2023
@rkooo567
Copy link
Contributor

rkooo567 commented Feb 9, 2023

ping me when tests pass

@rickyyx
Copy link
Contributor Author

rickyyx commented Feb 9, 2023

I guess making the whole class thread-safe is not hard. I could add a bunch of mutexes there.

But yeah, will open in another PR for this.

@rkooo567 rkooo567 merged commit 2bbe8c1 into ray-project:master Feb 9, 2023
rickyyx added a commit to rickyyx/ray that referenced this pull request Feb 9, 2023
…metric recording on counters. (ray-project#32355)

Signed-off-by: rickyyx <[email protected]>

When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data.

Global stats: 196276 total (5 active)
Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s
Execution time:  mean = 295.864 us, total = 58.071 s
Event stats:
....
        GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s
        GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s
....

This PR

introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places.
With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread)
cadedaniel pushed a commit that referenced this pull request Feb 9, 2023
…metric recording on counters. (#32355) (#32384)

Signed-off-by: rickyyx <[email protected]>

When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data.

Global stats: 196276 total (5 active)
Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s
Execution time:  mean = 295.864 us, total = 58.071 s
Event stats:
....
        GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s
        GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s
....

This PR

introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places.
With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread)
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…metric recording on counters. (ray-project#32355)

Signed-off-by: rickyyx <[email protected]>

When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data.

Global stats: 196276 total (5 active)
Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s
Execution time:  mean = 295.864 us, total = 58.071 s
Event stats:
....
        GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s
        GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s
....

This PR

introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places.
With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread)

Signed-off-by: Edward Oakes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core/tune] Performance regression in tune_scalability_bookkeeping_overhead
4 participants