Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1634] implement queue time/processing time metrics for rpc framework #2784

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ErikFang
Copy link
Contributor

@ErikFang ErikFang commented Oct 7, 2024

What changes were proposed in this pull request?

implement queue time/processing time metrics for rpc framework

Why are the changes needed?

to identify rpc processing bottelneck

Does this PR introduce any user-facing change?

No

How was this patch tested?

local

@ErikFang ErikFang changed the title [CELEBORN-1634] implement queue time/processing time metrics for rpc framework [WIP][CELEBORN-1634] implement queue time/processing time metrics for rpc framework Oct 7, 2024
@ErikFang ErikFang changed the title [WIP][CELEBORN-1634] implement queue time/processing time metrics for rpc framework [CELEBORN-1634] implement queue time/processing time metrics for rpc framework Nov 6, 2024
@ErikFang
Copy link
Contributor Author

all comments have been addressed

@ErikFang
Copy link
Contributor Author

24/11/12 19:43:05,899 INFO [ScalaTest-run-running-CelebornFetchFailureSuite] RpcMetricsTracker: RPC statistics for LifecycleManagerEndpoint
current queue size = 1
max queue length = 4
histogram for LifecycleManagerEndpoint RPC metrics: LifecycleManagerEndpoint_QueueTime
count: 24
min: 19792
mean: 147154.58333333334
p50: 135312.5
p75: 181541.5
p95: 334562.75
p99: 338792.0
max: 338792
histogram for LifecycleManagerEndpoint RPC metrics: PbRegisterShuffle
count: 3
min: 149326083
mean: 2.6625618E8
p50: 1.64685791E8
p75: 4.84756666E8
p95: 4.84756666E8
p99: 4.84756666E8
max: 484756666
histogram for LifecycleManagerEndpoint RPC metrics: LifecycleManagerEndpoint_ProcessTime
count: 24
min: 139041
mean: 5.2411515583333336E7
p50: 7733084.0
p75: 5.992785425E7
p95: 4.0866102025E8
p99: 4.84756666E8
max: 484756666
histogram for LifecycleManagerEndpoint RPC metrics: class org.apache.celeborn.common.protocol.message.ControlMessages$MapperEnd
count: 6
min: 249000
mean: 3067833.6666666665
p50: 1171312.5
p75: 7690146.5
p95: 7818959.0
p99: 7818959.0
max: 7818959
histogram for LifecycleManagerEndpoint RPC metrics: PbReportShuffleFetchFailure
count: 2
min: 11153666
mean: 1.12940205E7
p50: 1.12940205E7
p75: 1.1434375E7
p95: 1.1434375E7
p99: 1.1434375E7
max: 11434375
histogram for LifecycleManagerEndpoint RPC metrics: class org.apache.celeborn.common.protocol.message.ControlMessages$StageEnd
count: 3
min: 69858875
mean: 1.0791569433333333E8
p50: 7.3514125E7
p75: 1.80374083E8
p95: 1.80374083E8
p99: 1.80374083E8
max: 180374083
histogram for LifecycleManagerEndpoint RPC metrics: class org.apache.celeborn.common.protocol.message.ControlMessages$GetReducerFileGroup
count: 3
min: 929750
mean: 8965833.666666666
p50: 3748042.0
p75: 2.2219709E7
p95: 2.2219709E7
p99: 2.2219709E7
max: 22219709
histogram for LifecycleManagerEndpoint RPC metrics: PbGetShuffleId
count: 7
min: 139041
mean: 9638315.285714285
p50: 3027333.0
p75: 2.5389458E7
p95: 3.0134792E7
p99: 3.0134792E7
max: 30134792

24/11/12 19:43:05,907 INFO [celeborn-dispatcher-9] RpcMetricsTracker: RPC statistics for endpoint-verifier
current queue size = 1
max queue length = 4
histogram for endpoint-verifier RPC metrics: endpoint-verifier_ProcessTime
count: 1
min: 1202625
mean: 1202625.0
p50: 1202625.0
p75: 1202625.0
p95: 1202625.0
p99: 1202625.0
max: 1202625
histogram for endpoint-verifier RPC metrics: endpoint-verifier_QueueTime
count: 1
min: 3584
mean: 3584.0
p50: 3584.0
p75: 3584.0
p95: 3584.0
p99: 3584.0
max: 3584
histogram for endpoint-verifier RPC metrics: CheckExistence
count: 1
min: 1202625
mean: 1202625.0
p50: 1202625.0
p75: 1202625.0
p95: 1202625.0
p99: 1202625.0
max: 1202625

This is the typical output from a LifecylceManager

Originally, this feature was developed to monitor master RPC performance, we also found it is very useful to measure LifecylceManager RPC bottleneck

@cfmcgrady
Copy link
Contributor

we also found it is very useful to measure LifecylceManager RPC bottleneck

A very nice and helpful feature, and we also identified a performance bottleneck issue #2905 with MapperEnd RPC requests through this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants