feat(sn): add metrics for Append and Replicate RPCs #806

ijsong · 2024-06-06T22:28:16Z

What this PR does

This pull request defines metrics for measuring RPCs such as Append and
Replicate. It introduces four metrics:

log_rpc.server.duration measures the time spent processing inbound RPC calls
in microseconds. It is very similar to the rpc.server.duration defined by
OpenTelemetry, but our metric also measures the processing time triggered by
each call on a gRPC stream.
log_rpc.server.log_entry.size measures the size of appended log entries. It
is similar to the rpc.server.request.size metric, but our metric measures
the size of each log entry included in the appended batch.
log_rpc.server.batch.size measures the size of log entry batches appended.
log_rpc.server.log_entries_per_batch measures the number of log entries per
appended batch.

These metrics are histogram-type, allowing us to compute percentiles and analyze
histograms and heat maps. Users can leverage these metrics to analyze the
duration of RPCs, the distribution of log entry sizes, and the length of
batches. We expect users to find better configurations to optimize storage node
performance.

Which issue(s) this PR resolves

Update: #795

codecov · 2024-06-06T22:40:11Z

Codecov Report

Attention: Patch coverage is 88.99083% with 12 lines in your changes missing coverage. Please review.

Please upload report for BASE (dockerfile_grpc_health_probe@5aae6ec). Learn more about missing BASE report.

Files	Patch %	Lines
internal/storagenode/telemetry/metrics.go	88.63%	5 Missing and 5 partials ⚠️
internal/storagenode/logstream/append.go	75.00%	2 Missing ⚠️

Additional details and impacted files

@@                       Coverage Diff                       @@
##             dockerfile_grpc_health_probe     #806   +/-   ##
===============================================================
  Coverage                                ?   68.46%           
===============================================================
  Files                                   ?      182           
  Lines                                   ?    17899           
  Branches                                ?        0           
===============================================================
  Hits                                    ?    12255           
  Misses                                  ?     4859           
  Partials                                ?      785

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Improve unmarshaling performance by reusing buffers for ReplicateRequest in the backup replica. The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica receives replicated log entries from the primary replica via the gRPC service `github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends `ReplicateRequest` messages. Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message, which involves growing slices for fields such as LLSN and Data. This growth causes copy overhead whenever the slice capacities need to expand. To address this, we introduce a new method, `ResetReuse`, for reusing slices instead of resetting them completely. The `ResetReuse` method shrinks the slice lengths while preserving their capacities, thus avoiding the overhead of reallocating memory. Example implementation: ```go type Message struct { Buffer []byte // Other fields } func (m *Message) Reset() { *m = Message{} } func (m *Message) ResetReuse() { s := m.Buffer[:0] *m = Message{} m.Buffer = s } ``` Risks: This approach has potential downsides. Since the heap space consumed by the slices is not reclaimed, the storage node's memory consumption may increase. Currently, there is no mechanism to shrink the heap usage. Additionally, this PR changes the generated code. The protobuf compiler can revert it, which is contrary to our intention. To catch this mistake, this PR includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest) to verify that the buffer backing the slices is reused. Resolves: #795 See also: #806

This pull request defines metrics for measuring RPCs such as Append and Replicate. It introduces four metrics: - `log_rpc.server.duration` measures the time spent processing inbound RPC calls in microseconds. It is very similar to the `rpc.server.duration` defined by OpenTelemetry, but our metric also measures the processing time triggered by each call on a gRPC stream. - `log_rpc.server.log_entry.size` measures the size of appended log entries. It is similar to the `rpc.server.request.size` metric, but our metric measures the size of each log entry included in the appended batch. - `log_rpc.server.batch.size` measures the size of log entry batches appended. - `log_rpc.server.log_entries_per_batch` measures the number of log entries per appended batch. These metrics are histogram-type, allowing us to compute percentiles and analyze histograms and heat maps. Users can leverage these metrics to analyze the duration of RPCs, the distribution of log entry sizes, and the length of batches. We expect users to find better configurations to optimize storage node performance.

Improve unmarshaling performance by reusing buffers for ReplicateRequest in the backup replica. The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica receives replicated log entries from the primary replica via the gRPC service `github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends `ReplicateRequest` messages. Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message, which involves growing slices for fields such as LLSN and Data. This growth causes copy overhead whenever the slice capacities need to expand. To address this, we introduce a new method, `ResetReuse`, for reusing slices instead of resetting them completely. The `ResetReuse` method shrinks the slice lengths while preserving their capacities, thus avoiding the overhead of reallocating memory. Example implementation: ```go type Message struct { Buffer []byte // Other fields } func (m *Message) Reset() { *m = Message{} } func (m *Message) ResetReuse() { s := m.Buffer[:0] *m = Message{} m.Buffer = s } ``` Risks: This approach has potential downsides. Since the heap space consumed by the slices is not reclaimed, the storage node's memory consumption may increase. Currently, there is no mechanism to shrink the heap usage. Additionally, this PR changes the generated code. The protobuf compiler can revert it, which is contrary to our intention. To catch this mistake, this PR includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest) to verify that the buffer backing the slices is reused. Resolves: #795 See also: #806

ijsong requested a review from hungryjang as a code owner June 6, 2024 22:28

ijsong mentioned this pull request Jun 6, 2024

Performance Analysis and Improvement of LogStream Primary-Backup Replication #795

Open

3 tasks

ijsong mentioned this pull request Jun 8, 2024

perf(sn): reuse buffer for ReplicateRequest unmarshaling #808

Open

hungryjang approved these changes Jun 10, 2024

View reviewed changes

ijsong force-pushed the dockerfile_grpc_health_probe branch from ccee2ed to 5aae6ec Compare June 13, 2024 14:17

ijsong force-pushed the rpc_metrics branch from 2d750b5 to d52ff1d Compare June 13, 2024 14:17

Base automatically changed from dockerfile_grpc_health_probe to main June 15, 2024 06:53

ijsong merged commit 1925653 into main Jun 15, 2024
10 checks passed

ijsong deleted the rpc_metrics branch June 15, 2024 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sn): add metrics for Append and Replicate RPCs #806

feat(sn): add metrics for Append and Replicate RPCs #806

ijsong commented Jun 6, 2024

codecov bot commented Jun 6, 2024 •

edited

Loading

feat(sn): add metrics for Append and Replicate RPCs #806

feat(sn): add metrics for Append and Replicate RPCs #806

Conversation

ijsong commented Jun 6, 2024

What this PR does

Which issue(s) this PR resolves

codecov bot commented Jun 6, 2024 • edited Loading

Codecov Report

codecov bot commented Jun 6, 2024 •

edited

Loading