Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aligning metrics with OTel semantic conventions #4482

Merged
merged 27 commits into from
Oct 23, 2023

Conversation

xakep139
Copy link
Contributor

@xakep139 xakep139 commented Sep 27, 2023

Fixes #4432

The final names are:

  1. Meter Microsoft.Extensions.Diagnostics.ResourceMonitoring:
  • Observable Gauge process.cpu.utilization, unit 1, values: 0.0 - 1.0
  • Observable Gauge dotnet.process.memory.virtual.utilization, unit 1, values: 0.0 - 1.0
  • Observable UpDownCounter system.network.connections, unit {connection}, tags/attributes:
    • network.transport = tcp
    • network.type = [ipv4, ipv6]
    • system.network.state = [close, close_wait, closing, delete, established, fin_wait_1, fin_wait_2, last_ack, listen, syn_recv, syn_sent, time_wait]
  1. Meter Microsoft.AspNetCore.HeaderParsing:
  • Counter aspnetcore.header_parsing.parse_errors, tags/attributes:
    • error.type
    • aspnetcore.header_parsing.header.name
  • Counter aspnetcore.header_parsing.cache_accesses, tags/attributes:
    • aspnetcore.header_parsing.header.name
    • aspnetcore.header_parsing.cache_access.type
  1. Meter Microsoft.Extensions.Diagnostics.HealthChecks:
  • Counter dotnet.health_check.reports, tags/attributes:
    • dotnet.health_check.status of type HealthStatus, [Degraded, Healthy, Unhealthy]
  • Counter dotnet.health_check.unhealthy_checks, tags/attributes:
    • dotnet.health_check.name
    • dotnet.health_check.status of type HealthStatus, [Degraded, Healthy, Unhealthy]
  1. Resilience tags for enrichment:
  • error.type
  • request.dependency.name - to correlate with HttpClient metrics
  • request.name - to correlate with HttpClient metrics
Microsoft Reviewers: Open in CodeFlow

@xakep139 xakep139 self-assigned this Sep 27, 2023
@xakep139 xakep139 linked an issue Sep 27, 2023 that may be closed by this pull request
@martintmk
Copy link
Contributor

@xakep139
Copy link
Contributor Author

Can you also align the resilience tags?

https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.Resilience/Resilience/Internal/ResilienceTagNames.cs

Surely I will, thanks for pointing that out

@xakep139 xakep139 changed the base branch from main to release/8.0 October 4, 2023 13:36
@noahfalk
Copy link
Member

noahfalk commented Oct 5, 2023

@lmolkova - are you interested in taking a look at these?

/// </remarks>
/// <seealso cref="System.Diagnostics.Metrics.Instrument"/>
public static class ResourceUtilizationCounters
{
/// <summary>
/// Gets the CPU consumption of the running application in percentages.
/// Gets the CPU consumption of the running application in range <c>[0, 1]</c>.
Copy link

@reyang reyang Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a stepping back question regarding all utilizations measured by %.

Related discussions and PRs:

FYI @noahfalk IIRC for garbage collector metrics we decided to use absolute value (e.g. total time spent in GC since the beginning of the process lifecycle), and encourage users to derive utilization based on their actual need (so the user can choose whatever sliding window and sampling frequency).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall this component has a sampling mechanism and sliding window built into it and the app developer can configure some of those parameters or accept defaults. So when this component reports an instantaneous utilization measurement of 73%, it really means that over the last X minutes some aggregation of Y CPU usage samples had an average CPU utilization of 73% for the app developer's chosen X and Y. Its not the pattern I'd expect most metrics to follow as there is extra implementation complexity and it can't accurately be re-aggregated to compute other sliding windows later. However because the app developer configures a sampling window of known fixed size it doesn't suffer from the issues that those other metrics had where they inferred a sample window from the polling frequency or from some unpredictable process behavior.

A while back I encouraged the engineers who were working on this to simplify and not maintain their own sliding window sampling reservoir, but I feel like what is there now is still well defined and likely useful as long as the sampling window is configured well up-front. I didn't object to this even though I expect metrics of this style will be an outlier in the long run. Having something like this shouldn't block us from later creating an alternate metric that emits absolute measures of CPU and memory usage with more aggregation flexibility.

@reyang do you feel like that addresses your concerns or you are still worried here?

Copy link

@reyang reyang Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too concerned about having alternative metrics that customers can conveniently use - as long as there is clarity:

  1. What exactly does this metric mean. Refer to https://github.com/dotnet/extensions/pull/4482/files#r1347628433.
  2. What's the recommendation if there are multiple ways to get similar things - e.g. pros/cons, pitfalls, best practices. For example, https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What's the recommendation if there are multiple ways to get similar things - e.g. pros/cons, pitfalls, best practices. For example, https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Process.

@noahfalk given OpenTelemetry.Instrumentation.Process is still in Preview, I think we should make a decision about the direction - e.g. all CPU/Memory metrics will be exposed by Microsoft.Extensions.Diagnostics.ResourceMonitoring, and we will deprecate OpenTelemetry.Instrumentation.Process and point customers to Microsoft.Extensions.Diagnostics.ResourceMonitoring. Meanwhile we'll make sure Microsoft.Extensions.Diagnostics.ResourceMonitoring works for all supported version of the runtime + operating systems. Does this make sense and do you feel we can make a call before .NET 8 GA?

@Yun-Ting FYI since you've worked on OpenTelemetry.Instrumentation.Process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought would be:
In .NET 8 - Point devs to these ResourceMonitoring metrics. Either deprecate OTel.Instrumentation.Process or adjust guidance so that we only suggest it only as a fallback for devs that need the metric in a specific form not already handled by ResourceMonitoring.

In the future (.NET 9?) - Implement the core CPU + memory metrics currently in OTel.Instrumentation.Process inside System.Diagnostic.DiagnosticSource. I think metrics like process.cpu.time or process.memory.virtual are so commonly desired that we shouldn't be asking devs to add extra library dependencies or new background polling services (serviceCollection.AddResourceMonitoring()) in order to enable them. At this point ResourceMonitoring would serve two roles going forward: (1) support for networking metrics which are a little more involved to configure (2) back-compat for projects already using the other process metrics defined here and wanting to continue doing so.

What does everyone think? 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought would be: In .NET 8 - Point devs to these ResourceMonitoring metrics.

Just confirming - point devs that are using any supported version of .NET (core/framework), not just limited to folks who use .NET 8?

Copy link
Member

@noahfalk noahfalk Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant as of November 2023 (or whenever this assembly releases as stable), all .NET devs should be able to use ResourceMonitoring to satisfy their need for CPU/Virtual Memory metrics.

NOTE: After I wrote that I saw that this library has a weirdly inconsistent approach to how it measures memory. I'm presuming that gets resolved. If it didn't then I wouldn't feel comfortable recommending it.

src/Libraries/Microsoft.AspNetCore.HeaderParsing/Metric.cs Outdated Show resolved Hide resolved
public static partial HealthCheckReportCounter CreateHealthCheckReportCounter(Meter meter);

[Counter("name", "status", Name = @"R9\\HealthCheck\\UnhealthyHealthCheck")]
[Counter("health_check.name", "health.status", Name = "health_check.unhealthy_checks")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi @timmydo - A little while back you were looking for a counter on health checks. Perhaps this functionality matches what you wanted?


public const string FailureReason = "failure-reason";
public const string FailureReason = "resilience.failure.reason";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@martintmk martintmk Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note ere. These are just additional tags on top of the ones built-in into Polly:
https://www.pollydocs.org/advanced/telemetry.html#metrics

One of the tags is exception.type one as defined by:
https://opentelemetry.io/docs/specs/otel/trace/exceptions/

There is an intersection here between exception.type and error.type and I am wondering how should we handle this.

Just for context:

This is how HttpResponseMessage is translated into FailureResultContext:

And this is how exceptions are translated:

context.Tags.Add(new(ResilienceTagNames.FailureSource, e.Source));

Can we trim the FailureResultContext or maybe just expose add the following tags?

  • error.summary: will be status code such as 500 for HttpResponseMessage or exception summary for exceptions.
  • exception.source: applicable only for exceptions (or even don't add anything here)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the context!

To the original comment: if polly already reports exception.type, is there any reason to report resilience.failure.reason at all since it's always an exception type?

Can we trim the FailureResultContext or maybe just expose add the following tags?

error.summary: will be status code such as 500 for HttpResponseMessage or exception summary for exceptions.
exception.source: applicable only for exceptions (or even don't add anything here)

I'd avoid adding new tags to OTel-defined namespaces - this is against OTel recommendations https://github.com/open-telemetry/opentelemetry-specification/blob/563958cb2bd8529990f19fdce7a5f3643bf63091/specification/common/attribute-naming.md?plain=1#L128

But I believe that if the exception summary has low cardinality, you can put it into error.type attribute which is intended to capture all kinds of error codes and fallback to (full) exception type.

We can also introduce dotnet.exception.source attribute and (hopefully) reuse in other parts of .NET and 3rd party libraries.

Copy link
Contributor

@martintmk martintmk Oct 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with having just:

  • error.type - to represent both exception summary and the failure reason for results
  • dotnet.exception.source

And at initial release we won't add tags for the following properties?

  • FailureResultContext.Source
  • FailureResultContext.AdditionalInformation

Maybe we can slim the FailureResultContext down then, or maybe completely drop it. This is because the Polly v8 supports registering custom enrichers and we can just register HTTP-based enricher to Polly directly without having additional layers?

cc @geeknoid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR with the changes is up:

#4600

@geeknoid
Copy link
Member

@xakep139 Is this ready to go in?

@xakep139
Copy link
Contributor Author

@xakep139 Is this ready to go in?

Not yet, I need to address latest feedback that @lmolkova left

xakep139 and others added 2 commits October 17, 2023 20:32
…ing/Windows/WindowsCounters.cs

Co-authored-by: Liudmila Molkova <[email protected]>
@xakep139
Copy link
Contributor Author

@lmolkova please take a look at my latest changes, I addressed most of the issues after the discussion with @klauco and @noahfalk

Copy link

@lmolkova lmolkova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me, but I believe we still need to confirm a few things:

  • do we expect exception summarizer to produce low-cardinality summaries
  • do we still need the failure reason attribute
  • what exactly does CPU utilization metric reports and if it needs state attributes (as OTel one)


public const string DependencyName = "dep-name";
public const string DependencyName = "dotnet.resilience.dependency.name";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public const string DependencyName = "dotnet.resilience.dependency.name";
public const string DependencyName = "dotnet.dependency.name";

The idea behind adding these tags to resilience metrics was that we can correlate resilience events with the http metrics. In that sense the name of this tag should be the same as the tag used to track RequestName and DependencyName for HTTP scenarios.

Do we know how the RequestMetadata.DependencyName and RequestMetadata.RequestName is represented in Otel world?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current HttpClient metrics there's no such dimensions: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/built-in-metrics-system-net#instrument-httpclientrequestduration
Neither OTel specifies this: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-metrics.md#metric-httpclientrequestduration
I agree we need to unify these dimensions/attributes if they will represent same things in HttpClient metering (hopefully we'll add these bits eventually) and resilience libraries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup correct, these are not strictly tied to resilience so it would be better to not include it in the name to avoid doing breaking changes in the future.

Ideally, this would be the same name as for HTTP metrics.

@xakep139
Copy link
Contributor Author

xakep139 commented Oct 21, 2023

The changes look good to me, but I believe we still need to confirm a few things:

  • do we expect exception summarizer to produce low-cardinality summaries
  • do we still need the failure reason attribute
  • what exactly does CPU utilization metric reports and if it needs state attributes (as OTel one)

@geeknoid can you please clarify the expected cardinality that comes from exception summarizer?
@martintmk, looking at your comment #4482 (comment), are you saying we can replace dotnet.resilience.failure.source, dotnet.resilience.failure.reason and dotnet.resilience.failure.summary with error.type and dotnet.exception.source? Did I get it right?

Regarding CPU utilization we agreed with folks (Matej, Noah and you) that we don't cover it in that PR.

@martintmk
Copy link
Contributor

The changes look good to me, but I believe we still need to confirm a few things:

  • do we expect exception summarizer to produce low-cardinality summaries
  • do we still need the failure reason attribute
  • what exactly does CPU utilization metric reports and if it needs state attributes (as OTel one)

@geeknoid can you please clarify the expected cardinality that comes from exception summarizer? @martintmk, looking at your comment #4482 (comment), are you saying we can replace dotnet.resilience.failure.source, dotnet.resilience.failure.reason and dotnet.resilience.failure.summary with error.type and dotnet.exception.source? Did I get it right?

Correct, this is also based on recommendation by @lmolkova

@geeknoid
Copy link
Member

@xakep139 Yes, the point of the summarizer is specifically to produce low cardinality values devoid of privacy-sensitive information.

@xakep139
Copy link
Contributor Author

We agreed with @martintmk that I will apply naming changes he made in #4600 to this PR, specifically we keep these attributes:

  • request.dependency.name - neither Resilience nor .NET-specific
  • request.name - neither Resilience nor .NET-specific
  • error.type

Everything else is removed from Resilience tag names.

@lmolkova
Copy link

lmolkova commented Oct 23, 2023

request.dependency.name - neither Resilience nor .NET-specific
request.name - neither Resilience nor .NET-specific

I believe you take request and dependency terms from application Insights, correct?
Request or dependency names in this case are span name and request.dependency.name is an oxymoron (request is a server span and dependency is a client span).

There is no direct translation for span name to a metric tag, I'd say you'd have a metric with the same name as span.

If this is not an option, then consider dotnet.resilience.operation.name or something that would be specific to this library.
If you have a good suggestion for a genric attribute name, let's bring it up to the otel spec.

@geeknoid geeknoid merged commit 02c8641 into release/8.0 Oct 23, 2023
6 checks passed
@geeknoid geeknoid deleted the xakep139/4432-metric-names-convention branch October 23, 2023 17:59
@davidfowl
Copy link
Member

Excellent work!

@ghost ghost locked as resolved and limited conversation to collaborators Nov 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dotnet/extensions metric names to follow some convention