-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][state] Task Backend - reduce lock contention on debug stats / metric recording on counters. #32355
Conversation
Signed-off-by: rickyyx <[email protected]>
Can we get this reviewed and merged ASAP? This is on the critical path for a release test run and handoff to product. |
Fixing the tsan test. |
Together with changes from #32251 it is still rather slow on the Trying a batch size of 1k to see if it helps. |
Signed-off-by: rickyyx <[email protected]>
Signed-off-by: rickyyx <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good now, but for the maintainability perspective, I think we should make the whole class thread-safe (IMO partially thread-safe class like this is the worst for maintainability). E.g., if someone touches states within Stop
, OnJobFinished
, or the storage class, it can cause random issues that are difficult to debug.
So let's merge it first and make another PR to make the whole class thread-safe.
ping me when tests pass |
I guess making the whole class thread-safe is not hard. I could add a bunch of mutexes there. But yeah, will open in another PR for this. |
…metric recording on counters. (ray-project#32355) Signed-off-by: rickyyx <[email protected]> When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data. Global stats: 196276 total (5 active) Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s Execution time: mean = 295.864 us, total = 58.071 s Event stats: .... GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s .... This PR introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places. With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread)
…metric recording on counters. (#32355) (#32384) Signed-off-by: rickyyx <[email protected]> When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data. Global stats: 196276 total (5 active) Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s Execution time: mean = 295.864 us, total = 58.071 s Event stats: .... GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s .... This PR introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places. With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread)
…metric recording on counters. (ray-project#32355) Signed-off-by: rickyyx <[email protected]> When GcsTaskManager is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data. Global stats: 196276 total (5 active) Queueing time: mean = 5.255 ms, max = 4.545 s, min = -0.000 s, total = 1031.389 s Execution time: mean = 295.864 us, total = 58.071 s Event stats: .... GCSServer.deadline_timer.debug_state_dump - 85 total (1 active), CPU time: mean = 521.750 ms, total = 44.349 s GCSServer.deadline_timer.debug_state_event_stats_print - 15 total (1 active, 1 running), CPU time: mean = 404.255 ms, total = 6.064 s .... This PR introduced a thread-safe wrapper on CounterMap, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places. With access to counters thread-safe now, we could also remove the mutex locks on the GcsTaskManagerStorage since it's now thread-safe (only accessed from its dedicated io thread) Signed-off-by: Edward Oakes <[email protected]>
Signed-off-by: rickyyx [email protected]
Why are these changes needed?
When
GcsTaskManager
is busy processing task events, it is not supposed to slow down the GCS. However, we previously have mutexes protecting some of the counter states. So the main io service/thread will get blocked when trying to acquire locks to print debug states + record metrics + add telemetry data.This PR
CounterMap
, such that modifying and reading various debug counters will have minimal lock contentions. Also merged the count by task type for telemetry into the counter map. This way, we will not need to acquire locks at various places.GcsTaskManagerStorage
since it's now thread-safe (only accessed from its dedicated io thread)Related issue number
Closes #32202
Checks
https://buildkite.com/ray-project/release-tests-pr/builds/27896#_
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.