Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dotnet): export opentelemetry metrics #398

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

rhamzeh
Copy link
Member

@rhamzeh rhamzeh commented Jul 23, 2024

Description

Adds initial OpenTelemetry metrics to the .NET SDK

References

Generates: openfga/dotnet-sdk#69

Review Checklist

  • I have clicked on "allow edits by maintainers".
  • I have added documentation for new/changed functionality in this PR or in a PR to openfga.dev [Provide a link to any relevant PRs in the references section above]
  • The correct base branch is being used, if not main
  • I have added tests to validate that the change in functionality is working as expected

@rhamzeh rhamzeh marked this pull request as ready for review July 30, 2024 13:08
@rhamzeh rhamzeh requested a review from a team as a code owner July 30, 2024 13:08
Comment on lines 36 to 37
| `http.client.request.duration` | `int` | The total request time for FGA requests |
| `http.server.request.duration` | `int` | The amount of time the FGA server took to internally process nd evaluate the request |
Copy link

@Hawxy Hawxy Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning that these shouldn't be tags. Custom metrics ingest for most SaaS distributed tracing vendors charge based on the unique combinations of the metrics tags, so a latency distribution of 50-150ms would be result in "100 custom metrics" being charged, times the number of unique combinations of the other tags.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for heads up @Hawxy!

We removed these tags for now, but we're thinking of converting them to a bucket, something like:

0-5ms
5-15ms
15-50ms
50-150ms
150ms-500ms
500ms+

But your question raises a few other problems for the other attributes/tags

  • fga-client.user will be a problem for anyone with more than a couple of users
  • Model ID, Store ID, URL (Full) will be a problem for anyone who is working across many stores/models

We are thinking of making them Opt-in, do you have feedback on how this is normally approached?
We can add config on our side to Opt-in to them. But we were wondering if OTEL has standard tooling/config to opt-in/out of specific tags.

https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/docs/metrics/customizing-the-sdk/README.md#select-specific-tags seems to indicate that end users can choose to drop specific tags they don't need

Copy link

@Hawxy Hawxy Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we're thinking of converting them to a bucket, something like:

The question to ask is "are users going to slice metrics by these tags", which is the whole point of the tags to begin with. Keep in mind that these aren't tracing tags where you'd typically go ham with a pile of metadata to bake into individual traces. Does it make sense for a user to filter a metric graph of fga-client.request.duration by http.client.request.duration? I don't really think so, thus it's not a valuable tag to include. Is slicing by http.request.resend_count something people are going to do?

We are thinking of making them Opt-in, do you have feedback on how this is normally approached?

Some projects add a configuration option added to enable verbose tagging/metrics. Delegating to OTEL filtering is fine, albeit a bit clunky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more about this, we're thinking of adding "fine-grained" config for the attributes/tags.

Something along the lines of:

var configuration = new ClientConfiguration() {
    ApiUrl = "http://localhost:8080",
    StoreId = "...",
    Credentials = new Credentials() { ... },
    Telemetry = new OpenFgaTelemetryConfig {
        Metrics: {
            [TelemetryHistograms.RequestDuration] = {
                Attributes: [Attributes.AttributeRequestMethod, Attributes.AttributeRequestStoreId]
            },
            [TelemetryCounters.TokenExchangeCountKey] = {
                Attributes: [Attributes.AttributeRequestModelId, Attributes.AttributeRequestClientId]
            },
        }
    }
};
var fgaClient = new OpenFgaClient(configuration);

If not set, we would enable a base set of metrics with minimal attributes, if configured, we follow whatever is configured. We will couple that with warnings in the OTEL config documentation around which attributes could be cost-prohibitive.

This allows folks to not enable this by accident (they'd have to manually opt-in), while giving them the ability to be able to have visibility on things like:

  • Whether a client id is sending a disproportionate amount of calls or request tokens (could be an indication that it was misconfigured - eg.g they are initializing the SDK multiple times causing a credential request per call)
  • Whether their new model is causing significantly more latency than the old one
  • Whether slow requests are due to retries (tracing helps here, but usually traces are sampled and people might miss this)
  • The ratio of success vs. bad requests vs rate limits by model id, store id or client id so folks can understand whether a particular client is being called incorrectly or a particular model is problematic
  • Understanding whether they have still old clients running that they need to upgrade and how that goes with the errors they are getting (through the user agent)

How does that sound to you?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that sounds pretty good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hawxy FYI - this has just been released in v0.5.1 - we're backporting those to the SDK Generator soon

Docs: https://github.com/openfga/dotnet-sdk/blob/main/OpenTelemetry.md

@rhamzeh rhamzeh force-pushed the feat/dotnet-telemetry-metrics branch from ec4051c to 0fcb161 Compare July 30, 2024 15:29
Considering that we already have histograms for them, and the const impact of
having attributes with high variance, we're dropping the following two
attributes:
- `http.client.request.duration`
- `http.server.request.duration`
@rhamzeh rhamzeh force-pushed the feat/dotnet-telemetry-metrics branch from 0fcb161 to cae5eb9 Compare July 30, 2024 18:55
Copy link
Member

@evansims evansims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants