Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sn): add metrics for Append and Replicate RPCs #806

Merged
merged 1 commit into from
Jun 15, 2024
Merged

Conversation

ijsong
Copy link
Member

@ijsong ijsong commented Jun 6, 2024

What this PR does

This pull request defines metrics for measuring RPCs such as Append and
Replicate. It introduces four metrics:

  • log_rpc.server.duration measures the time spent processing inbound RPC calls
    in microseconds. It is very similar to the rpc.server.duration defined by
    OpenTelemetry, but our metric also measures the processing time triggered by
    each call on a gRPC stream.
  • log_rpc.server.log_entry.size measures the size of appended log entries. It
    is similar to the rpc.server.request.size metric, but our metric measures
    the size of each log entry included in the appended batch.
  • log_rpc.server.batch.size measures the size of log entry batches appended.
  • log_rpc.server.log_entries_per_batch measures the number of log entries per
    appended batch.

These metrics are histogram-type, allowing us to compute percentiles and analyze
histograms and heat maps. Users can leverage these metrics to analyze the
duration of RPCs, the distribution of log entry sizes, and the length of
batches. We expect users to find better configurations to optimize storage node
performance.

Which issue(s) this PR resolves

Update: #795

Copy link

codecov bot commented Jun 6, 2024

Codecov Report

Attention: Patch coverage is 88.99083% with 12 lines in your changes missing coverage. Please review.

Please upload report for BASE (dockerfile_grpc_health_probe@5aae6ec). Learn more about missing BASE report.

Files Patch % Lines
internal/storagenode/telemetry/metrics.go 88.63% 5 Missing and 5 partials ⚠️
internal/storagenode/logstream/append.go 75.00% 2 Missing ⚠️
Additional details and impacted files
@@                       Coverage Diff                       @@
##             dockerfile_grpc_health_probe     #806   +/-   ##
===============================================================
  Coverage                                ?   68.46%           
===============================================================
  Files                                   ?      182           
  Lines                                   ?    17899           
  Branches                                ?        0           
===============================================================
  Hits                                    ?    12255           
  Misses                                  ?     4859           
  Partials                                ?      785           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ijsong added a commit that referenced this pull request Jun 8, 2024
Improve unmarshaling performance by reusing buffers for ReplicateRequest in the
backup replica.

The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has
two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica
receives replicated log entries from the primary replica via the gRPC service
`github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends
`ReplicateRequest` messages.

Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message,
which involves growing slices for fields such as LLSN and Data. This growth
causes copy overhead whenever the slice capacities need to expand.

To address this, we introduce a new method, `ResetReuse`, for reusing slices
instead of resetting them completely. The `ResetReuse` method shrinks the slice
lengths while preserving their capacities, thus avoiding the overhead of
reallocating memory.

Example implementation:

```go
type Message struct {
    Buffer []byte
    // Other fields
}

func (m *Message) Reset() {
    *m = Message{}
}

func (m *Message) ResetReuse() {
    s := m.Buffer[:0]
    *m = Message{}
    m.Buffer = s
}
```

Risks:

This approach has potential downsides. Since the heap space consumed by the
slices is not reclaimed, the storage node's memory consumption may increase.
Currently, there is no mechanism to shrink the heap usage.

Additionally, this PR changes the generated code. The protobuf compiler can
revert it, which is contrary to our intention. To catch this mistake, this PR
includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest)
to verify that the buffer backing the slices is reused.

Resolves: #795
See also: #806
ijsong added a commit that referenced this pull request Jun 8, 2024
Improve unmarshaling performance by reusing buffers for ReplicateRequest in the
backup replica.

The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has
two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica
receives replicated log entries from the primary replica via the gRPC service
`github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends
`ReplicateRequest` messages.

Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message,
which involves growing slices for fields such as LLSN and Data. This growth
causes copy overhead whenever the slice capacities need to expand.

To address this, we introduce a new method, `ResetReuse`, for reusing slices
instead of resetting them completely. The `ResetReuse` method shrinks the slice
lengths while preserving their capacities, thus avoiding the overhead of
reallocating memory.

Example implementation:

```go
type Message struct {
    Buffer []byte
    // Other fields
}

func (m *Message) Reset() {
    *m = Message{}
}

func (m *Message) ResetReuse() {
    s := m.Buffer[:0]
    *m = Message{}
    m.Buffer = s
}
```

Risks:

This approach has potential downsides. Since the heap space consumed by the
slices is not reclaimed, the storage node's memory consumption may increase.
Currently, there is no mechanism to shrink the heap usage.

Additionally, this PR changes the generated code. The protobuf compiler can
revert it, which is contrary to our intention. To catch this mistake, this PR
includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest)
to verify that the buffer backing the slices is reused.

Resolves: #795

See also: #806
This pull request defines metrics for measuring RPCs such as Append and
Replicate. It introduces four metrics:

- `log_rpc.server.duration` measures the time spent processing inbound RPC calls
  in microseconds. It is very similar to the `rpc.server.duration` defined by
  OpenTelemetry, but our metric also measures the processing time triggered by
  each call on a gRPC stream.
- `log_rpc.server.log_entry.size` measures the size of appended log entries. It
  is similar to the `rpc.server.request.size` metric, but our metric measures
  the size of each log entry included in the appended batch.
- `log_rpc.server.batch.size` measures the size of log entry batches appended.
- `log_rpc.server.log_entries_per_batch` measures the number of log entries per
  appended batch.

These metrics are histogram-type, allowing us to compute percentiles and analyze
histograms and heat maps. Users can leverage these metrics to analyze the
duration of RPCs, the distribution of log entry sizes, and the length of
batches. We expect users to find better configurations to optimize storage node
performance.
@ijsong ijsong force-pushed the dockerfile_grpc_health_probe branch from ccee2ed to 5aae6ec Compare June 13, 2024 14:17
ijsong added a commit that referenced this pull request Jun 13, 2024
Improve unmarshaling performance by reusing buffers for ReplicateRequest in the
backup replica.

The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has
two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica
receives replicated log entries from the primary replica via the gRPC service
`github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends
`ReplicateRequest` messages.

Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message,
which involves growing slices for fields such as LLSN and Data. This growth
causes copy overhead whenever the slice capacities need to expand.

To address this, we introduce a new method, `ResetReuse`, for reusing slices
instead of resetting them completely. The `ResetReuse` method shrinks the slice
lengths while preserving their capacities, thus avoiding the overhead of
reallocating memory.

Example implementation:

```go
type Message struct {
    Buffer []byte
    // Other fields
}

func (m *Message) Reset() {
    *m = Message{}
}

func (m *Message) ResetReuse() {
    s := m.Buffer[:0]
    *m = Message{}
    m.Buffer = s
}
```

Risks:

This approach has potential downsides. Since the heap space consumed by the
slices is not reclaimed, the storage node's memory consumption may increase.
Currently, there is no mechanism to shrink the heap usage.

Additionally, this PR changes the generated code. The protobuf compiler can
revert it, which is contrary to our intention. To catch this mistake, this PR
includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest)
to verify that the buffer backing the slices is reused.

Resolves: #795

See also: #806
ijsong added a commit that referenced this pull request Jun 13, 2024
Improve unmarshaling performance by reusing buffers for ReplicateRequest in the
backup replica.

The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has
two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica
receives replicated log entries from the primary replica via the gRPC service
`github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends
`ReplicateRequest` messages.

Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message,
which involves growing slices for fields such as LLSN and Data. This growth
causes copy overhead whenever the slice capacities need to expand.

To address this, we introduce a new method, `ResetReuse`, for reusing slices
instead of resetting them completely. The `ResetReuse` method shrinks the slice
lengths while preserving their capacities, thus avoiding the overhead of
reallocating memory.

Example implementation:

```go
type Message struct {
    Buffer []byte
    // Other fields
}

func (m *Message) Reset() {
    *m = Message{}
}

func (m *Message) ResetReuse() {
    s := m.Buffer[:0]
    *m = Message{}
    m.Buffer = s
}
```

Risks:

This approach has potential downsides. Since the heap space consumed by the
slices is not reclaimed, the storage node's memory consumption may increase.
Currently, there is no mechanism to shrink the heap usage.

Additionally, this PR changes the generated code. The protobuf compiler can
revert it, which is contrary to our intention. To catch this mistake, this PR
includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest)
to verify that the buffer backing the slices is reused.

Resolves: #795

See also: #806
ijsong added a commit that referenced this pull request Jun 14, 2024
Improve unmarshaling performance by reusing buffers for ReplicateRequest in the
backup replica.

The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has
two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica
receives replicated log entries from the primary replica via the gRPC service
`github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends
`ReplicateRequest` messages.

Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message,
which involves growing slices for fields such as LLSN and Data. This growth
causes copy overhead whenever the slice capacities need to expand.

To address this, we introduce a new method, `ResetReuse`, for reusing slices
instead of resetting them completely. The `ResetReuse` method shrinks the slice
lengths while preserving their capacities, thus avoiding the overhead of
reallocating memory.

Example implementation:

```go
type Message struct {
    Buffer []byte
    // Other fields
}

func (m *Message) Reset() {
    *m = Message{}
}

func (m *Message) ResetReuse() {
    s := m.Buffer[:0]
    *m = Message{}
    m.Buffer = s
}
```

Risks:

This approach has potential downsides. Since the heap space consumed by the
slices is not reclaimed, the storage node's memory consumption may increase.
Currently, there is no mechanism to shrink the heap usage.

Additionally, this PR changes the generated code. The protobuf compiler can
revert it, which is contrary to our intention. To catch this mistake, this PR
includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest)
to verify that the buffer backing the slices is reused.

Resolves: #795

See also: #806
ijsong added a commit that referenced this pull request Jun 14, 2024
Improve unmarshaling performance by reusing buffers for ReplicateRequest in the
backup replica.

The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has
two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica
receives replicated log entries from the primary replica via the gRPC service
`github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends
`ReplicateRequest` messages.

Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message,
which involves growing slices for fields such as LLSN and Data. This growth
causes copy overhead whenever the slice capacities need to expand.

To address this, we introduce a new method, `ResetReuse`, for reusing slices
instead of resetting them completely. The `ResetReuse` method shrinks the slice
lengths while preserving their capacities, thus avoiding the overhead of
reallocating memory.

Example implementation:

```go
type Message struct {
    Buffer []byte
    // Other fields
}

func (m *Message) Reset() {
    *m = Message{}
}

func (m *Message) ResetReuse() {
    s := m.Buffer[:0]
    *m = Message{}
    m.Buffer = s
}
```

Risks:

This approach has potential downsides. Since the heap space consumed by the
slices is not reclaimed, the storage node's memory consumption may increase.
Currently, there is no mechanism to shrink the heap usage.

Additionally, this PR changes the generated code. The protobuf compiler can
revert it, which is contrary to our intention. To catch this mistake, this PR
includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest)
to verify that the buffer backing the slices is reused.

Resolves: #795

See also: #806
Base automatically changed from dockerfile_grpc_health_probe to main June 15, 2024 06:53
@ijsong ijsong merged commit 1925653 into main Jun 15, 2024
10 checks passed
@ijsong ijsong deleted the rpc_metrics branch June 15, 2024 06:54
ijsong added a commit that referenced this pull request Jun 15, 2024
Improve unmarshaling performance by reusing buffers for ReplicateRequest in the
backup replica.

The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has
two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica
receives replicated log entries from the primary replica via the gRPC service
`github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends
`ReplicateRequest` messages.

Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message,
which involves growing slices for fields such as LLSN and Data. This growth
causes copy overhead whenever the slice capacities need to expand.

To address this, we introduce a new method, `ResetReuse`, for reusing slices
instead of resetting them completely. The `ResetReuse` method shrinks the slice
lengths while preserving their capacities, thus avoiding the overhead of
reallocating memory.

Example implementation:

```go
type Message struct {
    Buffer []byte
    // Other fields
}

func (m *Message) Reset() {
    *m = Message{}
}

func (m *Message) ResetReuse() {
    s := m.Buffer[:0]
    *m = Message{}
    m.Buffer = s
}
```

Risks:

This approach has potential downsides. Since the heap space consumed by the
slices is not reclaimed, the storage node's memory consumption may increase.
Currently, there is no mechanism to shrink the heap usage.

Additionally, this PR changes the generated code. The protobuf compiler can
revert it, which is contrary to our intention. To catch this mistake, this PR
includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest)
to verify that the buffer backing the slices is reused.

Resolves: #795

See also: #806
ijsong added a commit that referenced this pull request Jun 16, 2024
Improve unmarshaling performance by reusing buffers for ReplicateRequest in the
backup replica.

The protobuf message `github.com/kakao/varlog/proto/snpb.(ReplicateRequest)` has
two slice fields—LLSN (`[]uint64`) and Data (`[][]byte`). The backup replica
receives replicated log entries from the primary replica via the gRPC service
`github.com/kakao/varlog/proto/snpb.(ReplicatorServer).Replicate`, which sends
`ReplicateRequest` messages.

Upon receiving a `ReplicateRequest`, the backup replica unmarshals the message,
which involves growing slices for fields such as LLSN and Data. This growth
causes copy overhead whenever the slice capacities need to expand.

To address this, we introduce a new method, `ResetReuse`, for reusing slices
instead of resetting them completely. The `ResetReuse` method shrinks the slice
lengths while preserving their capacities, thus avoiding the overhead of
reallocating memory.

Example implementation:

```go
type Message struct {
    Buffer []byte
    // Other fields
}

func (m *Message) Reset() {
    *m = Message{}
}

func (m *Message) ResetReuse() {
    s := m.Buffer[:0]
    *m = Message{}
    m.Buffer = s
}
```

Risks:

This approach has potential downsides. Since the heap space consumed by the
slices is not reclaimed, the storage node's memory consumption may increase.
Currently, there is no mechanism to shrink the heap usage.

Additionally, this PR changes the generated code. The protobuf compiler can
revert it, which is contrary to our intention. To catch this mistake, this PR
includes a unit test (github.com/kakao/varlog/proto/snpb.TestReplicateRequest)
to verify that the buffer backing the slices is reused.

Resolves: #795

See also: #806
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants