Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime instrumentation #207

Merged
merged 20 commits into from
Mar 3, 2022
Merged

Runtime instrumentation #207

merged 20 commits into from
Mar 3, 2022

Conversation

twenzel
Copy link
Contributor

@twenzel twenzel commented Feb 16, 2022

This PR implements Runtime instrumentation (#204 ).

It create instrumentation for all runtime metrics unless configured.

All information is retrieved directly (similar to RuntimeEventSource).
There are only metrics for values which can be retrieved via Runtime API.

@twenzel twenzel requested a review from a team February 16, 2022 22:58
public AssembliesInstrumentation(Meter meter)
{
this.meter = meter;
this.assemblyCounter = meter.CreateObservableCounter("assembly-count", () => AppDomain.CurrentDomain.GetAssemblies().Length, description: "Number of Assemblies Loaded");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the names for the metrics in this PR should be prefixed with the process.runtime.dotnet. namespace. In this case something like process.runtime.dotnet.assembly_count (using an _ rather than - since I think that's more commonly what I've seen in other semantic conventions).

The spec has some general guidance here for runtime metrics: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/runtime-environment-metrics.md#runtime-environment-specific-metrics---processruntimeenvironment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Names are now prefixed.

this.meter = meter;
this.eventCounterStore = eventCounterStore;

this.eventCounterStore.Subscribe(CpuTimeCounterName, EventCounterType.Mean);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instrumentation classes should probably implement IDisposable and clean up subscriptions in the event the MeterProvider is shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great hint. Subscription clean up is now done on Dispose.

try
{
this.refreshIntervalDictionary = new Dictionary<string, string>();
this.refreshIntervalDictionary.Add("EventCounterIntervalSec", "1");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twenzel this is doing exactly same as #177 as far as "configure the polling schedule" - except its hardcoded to 1 second. This PR is making this specific to runtime metrics as opposed to any available event counters including what other libraries/ customers might already have in place. @reyang is this what you had in mind for #207 - to restrict to runtime metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is dedicated to the runtime metrics and is using the API directly to retrieve the values.
Unfortunately sometimes the API is internal, so we've to fallback to the EventCounter.
The advantage of a this instrumentation is that it can use the direct API as soon as the runtime make it public. It will perform better as retrieving the values from event counters with less overhead.
The EventCounter instrumentation from #177 is a general purpose implementation to retrieve any existing EventCounter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we want this "Runtime Instrumentation" to use the OpenTelemetry Metrics API rather than relying on the EventCounter. The main reason is efficiency and accuracy - especially considering Prometheus scenario that using an asynchronous instrument will make sure we only invoke the callback when Prometheus is scaping metrics, rather than having to poll the EventCounter based on some predetermined schedule.

If there are cases where we cannot use OpenTelemetry Metrics API (e.g. there is no proper API - such as ObservableUpDownCounter) or we cannot access the internal functionality of .NET runtime (e.g. if a certain EventCounter is implemented by calling some non-public API), we should discuss and see if we can find better approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately sometimes the API is internal, so we've to fallback to the EventCounter.

I suggest that we call out the list of APIs, and we should work with the .NET Runtime team to see if it makes sense to make these public.

Copy link
Member

@reyang reyang Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hananiel the goal for #204 is to have a pure OpenTelemetry Metrics API based instrumentation library for .NET runtime - that does not rely on EventCounter at all (as a consequence, provides better efficiency, sampling accuracy and probably better mode - e.g. different gen of GC could be modeled as metric dimensions/attributes rather than each modeled as a separate metric).

Copy link
Member

@reyang reyang Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang These are the APIs that are internal and should be made public:

  • GC.GetLastGCPercentTimeInGC()
  • GC.GetGenerationSize() // for generation 0-4
  • System.Reflection.Assembly.GetAssemblyCount()
  • Exception.GetExceptionCount()
  • RuntimeEventSourceHelper.GetCpuUsage()

Should I make PRs or are you discuss these requests internally?

@cijothomas and I had a discussion with the .NET runtime team (@maryamariyan, @noahfalk and @tarekgh), here goes a quick summary:

  1. Some of the methods (e.g. RuntimeEventSourceHelper.GetCpuUsage()) were introduced into the runtime to reduce dependencies, there is likely going to be public APIs in other .NET runtime assemblies that meet the requirements here.
  2. Some methods might have no good alternatives that are public. From .NET perspective, @noahfalk can help to see what's the direction - e.g. we might make certain API public in the future version of runtime, or design some public APIs that provide the same or even more feature.
  3. Given this OpenTelemetry package will have to work with existing versions of the runtime, we need use reflection (probably in combination with some conditional compilation or runtime detection), for the functions that we took reflection approach here, .NET team could add comment in the source code to avoid breaking OpenTelemetry accidentally - before we can fully move to public APIs.
  4. All of the above doesn't need to block this PR, we can address these 5 missing instruments in separate PRs.

@noahfalk to share more insights.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • GC.GetGenerationSize() could be replaced with GC.GetGCMemoryInfo().GenerationInfo[i].SizeAfterBytes

  • GetCpuUsage() is a bit messy design-wise because the computation depends on how long it has been since the API was last called. It works OK if there is only one consumer of the counter, but imagine what happens if there are two exporters and each of them invokes the callback. If we imagine each of them is polling on the same interval (lets say once per minute), then one of them will be reporting the average CPU usage over the last minute and the other one will be reporting the average CPU usage over the last ~1ms since the previous exporter invoked the callback. That tiny sampling window could give very skewed results. We might need to redefine the metric or reconsider how it should be computed. The underlying calculation should be something like:
    At the beginning and end of an interval evaluate System.Diagnostics.Process.GetCurrentProcess().TotalProcessorTime, subtract the end from the beginning, and divide by (time_interval * Environment.ProcessorCount).

For the other three APIs I created a runtime issue to track adding them: dotnet/runtime#65989

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reyang @cijothomas - Talking with Maoni about the percent-time-in-gc counter a similar issue arises that I see with the cpu % counter. It seems like what we'd want to do is represent these things as ObservableCounters which measure the number of seconds where something was true:
CPU = seconds/interval where CPU was in use
GC = seconds/interval where GC had all threads paused

Copy link
Member

@reyang reyang Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take an example to make sure I understand the proposal:

Time since program started Total time when GC had all threads paused
1.5 seconds 0.1 second
6.5 seconds 1.1 second
16.5 seconds 2.1 second
  • T0: 0s
  • T1: 1.5s
  • T2: 6.5s
  • T3: 16.5s

Option 1:

  • ObservableCounter("process.runtime.dotnet.gc.time")
  • at T1, record 0.1 // (T0, T1]
  • at T2, record 1.1 // (T0, T2]
  • at T3, record 2.1 // (T0, T3]

The consumer can configure the exporter to use delta temporality, so proper rate can be calculated downstream.

Option 2:

  • at T1, record 0.067 // 0.1 / 1.5
  • at T2, record 0.2 // (1.1 - 0.1) / (6.5 - 1.5)
  • at T3, record 0.1 // (2.1 - 1.1) / (16.5 - 6.5)

It seems Option 2 won't work, as the data being recorded won't fit any existing instrument (unless we want to pick something like async Gauge).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep exactly, option 1 is what I was suggesting we should start doing and option 2 is the problematic path that the cpu-usage EventCounter is doing right now.

Option (2) works ok with an ObservableGauge only if there is guaranteed exactly one observer/exporter, allowing the callback to infer the interval by timestamping when it was last called. As soon as there is a 2nd observer then the callback can't distinguish if the call at T2 is a new observer trying to measure the interval [0, T2] or the same observer that called at T1 now trying to measure [T1, T2].

At risk of getting a little into the weeds, the current runtime implementation of GetLastGCPercentTimeInGC() doesn't capture a timestamp of when it was last called, but the way it avoids doing that produces a measurement people are likely to find confusing. In your example above, imagine that the various GCs occured at:
GC 1: 1.0-1.2 (0.1 sec paused)
GC 2: 2.5-3.9 (0.9 sec paused)
GC 3: 4.0-4.1 (0.1 sec paused)
GC 4: 15.0-16.2 (1 sec paused)
You can see GCs can also do some work while the threads aren't paused. Adding up the pause times gives the same 0.1, 1.1 and 2.1 measurements as above when observed at T=1.5, 6.5, and 16.5.
If you measure at some time T, GetLastGCPercentTimeInGC() finds the most recent GC X that finished before time T and calculates GC_x_pause/ ( GC_x_end - GC_x-1_end). For example if you query at T=6.5 it determines GC 3 was the most recent completed GC before that point which ended at 4.1. Before that GC 2 ended at 3.9. That gives 0.1 pause time/(4.1-3.9) = 50% time in GC. Users were probably hoping to see (1.1-0.1)/(6.5-1.5)=20% time in GC and will be surprised not to see that. So the current GC stats don't suffer non-determinism when subjected to multiple observers, but only by using a surprising calculation that isn't necessarily a good approximation of what users really wanted.

@omamoo
Copy link

omamoo commented Feb 28, 2022

@twenzel Can you make the MetricPrefix configurable? process_runtime_dotnet is prefix matching Prometheus but differ than Datadog metrics for example.

this.eventCounterStore.Subscribe(LohSizeCounterName, EventCounterType.Mean);
this.eventCounterStore.Subscribe(PohSizeCounterName, EventCounterType.Mean);

this.gcHeapSizeCounter = meter.CreateObservableCounter($"{RuntimeMetrics.MetricPrefix}gc_heap_size", () => (double)(GC.GetTotalMemory(false) / 1_000_000), "MB", "GC Heap Size");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask a question?
Why ObservableCounter and not ObservableGauge?

From the docs:
ObservableCounter - ObservableCounter is an Instrument which reports monotonically increasing value(s)
Observable Gauge - ObservableGauge is an asynchronous Instrument which reports non-additive value(s)

The heap size is non-additive value and can decrease.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

this.gen0GCCounter = meter.CreateObservableGauge($"{RuntimeMetrics.MetricPrefix}gen_0-gc_count", () => GC.CollectionCount(0), description: "Gen 0 GC Count");
this.gen1GCCounter = meter.CreateObservableGauge($"{RuntimeMetrics.MetricPrefix}gen_1-gc_count", () => GC.CollectionCount(1), description: "Gen 1 GC Count");
this.gen2GCCounter = meter.CreateObservableGauge($"{RuntimeMetrics.MetricPrefix}gen_2-gc_count", () => GC.CollectionCount(2), description: "Gen 2 GC Count");
this.allocRateCounter = meter.CreateObservableGauge($"{RuntimeMetrics.MetricPrefix}alloc_rate", () => GC.GetTotalAllocatedBytes(), "B", "Allocation Rate");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions again here, GC.GetTotalAllocatedBytes is IncrementingPollingCounter
RuntimeEventSource

From the docs - Provides a variant of EventCounter for variables that are ever-increasing, such as the number of exceptions in the runtime

Why not to use ObservableCounter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

/// <summary>
/// Extensions to configure the runtime metrics options.
/// </summary>
public static class RuntimeMetricsOptionsExtensions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to have this (atleast in an initial release).As RuntimeMetricsOptions already has public properties to enable each category of metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the general idea - start with the absolute minimum public API. We can always add more public APIs based on feedbacks from initial release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the general idea - start with the absolute minimum public API. We can always add more public APIs based on feedbacks from initial release.

+1 👍

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach seems good for me , except that we need to remove EventCounter specifics from this PR. It'll limit the number of metrics available, but we can work with runtime team to expose more, without going through EventCounters.

Will do another review once EventCounter stuff is removed (that should reduce the PR size by half)

```csharp
using var meterProvider = Sdk.CreateMeterProviderBuilder()
.AddRuntimeMetrics(options => options
.WithGc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify the examples to match the updated version.

/// <summary>
/// Gets a value indicating whether all metrics are enabled.
/// </summary>
internal bool IsAllEnabled => this.GcEnabled == null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you considered making everything opt-in? i.e by default, user gets nothing by simply adding AddRunTimeMetrics(). One needs to explicitly enable individually to get them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then the configure parameter in MeterProviderBuilderExtensions.AddRuntimeMetrics should be mandatory, right?. Otherwise it makes no sense to add AddRuntimeMetrics().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could make some of them as enabled by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cijothomas ok, which one should be enabled by default? "Performance"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pick anything random for now or make all Enabled by default. We can merge and get a nuget out, and start collecting some feedback, and adjust accordingly.

if (options.IsGcEnabled)
{
this.meter.CreateObservableGauge($"{options.MetricPrefix}gc_heap_size", () => (double)(GC.GetTotalMemory(false) / 1_000_000), "MB", "GC Heap Size");
this.meter.CreateObservableGauge($"{options.MetricPrefix}gen_0-gc_count", () => GC.CollectionCount(0), description: "Gen 0 GC Count");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we use a single separator rather than a mixture of both "-" and "_".

Copy link
Member

@reyang reyang Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose int64 process.runtime.dotnet.gc.heap, and the unit is Bytes.

For GC count, use process.runtime.dotnet.gc.count, and put gen as dimension/attribute.

Doesn't have to block this PR though.

}
#endif

if (options.IsPerformanceEnabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be named "Memory", or "Process Memory"? "Performance" seems very broad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "category" is used to provide the cpu-usage as well, which was removed for the 1st version.
What about "ProcessPerformance"?

Copy link
Contributor

@noahfalk noahfalk Mar 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed later that we were planning to prefix all these metrics with "process.runtime.dotnet." and the OT semantic conventions suggest this value should be named something like "process.memory.xyz". Maybe we should leave this instrument out of this Meter and put it with a different collection of non-runtime process information? Then the category would be gone and the name wouldn't matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noahfalk Does the semantics didn't say "process.runtime.xyz.memory.usage" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm not sure I understood the question?
I believe the semantics are distinguishing memory allocated by a runtime's GC (process.runtime.dotnet.something) vs. all memory allocated by the process for any purpose (process.memory.something). The original EventCounters for the runtime were created prior to these semantic conventions existing and they weren't attempting to separate process-wide information from runtime-specific information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the OpenTelemetry semantic conventions are still at very early stages.
Take process.memory.usage and process.memory.virtual for example, the names are bit confusing, and memory measurements are much broader if we consider things like private working set, shared working set, commit size, reserved size, managed vs. native/unmanaged memory.

If I understand @noahfalk correctly, I think the intention is to put .NET runtime specific things under "process.runtime.dotnet.*", and put operating system specific things under a different place that doesn't have "runtime.dotnet" in the name. In other words, we need to ask ourselves - would this metric also apply to C++, if yes, it should not have "dotnet" in the name.

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many nice discussions and I think all of them can be addressed in follow up PRs.

I would recommend that in this PR we focus on making the CI clean (100% pass), leave the other nice things for follow up PRs (and I'm happy to send a follow up PR if that helps).

@codecov
Copy link

codecov bot commented Mar 3, 2022

Codecov Report

Merging #207 (69ec265) into main (5f4fce1) will increase coverage by 0.49%.
The diff coverage is 98.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #207      +/-   ##
==========================================
+ Coverage   78.36%   78.85%   +0.49%     
==========================================
  Files          89       92       +3     
  Lines        2334     2393      +59     
==========================================
+ Hits         1829     1887      +58     
- Misses        505      506       +1     
Impacted Files Coverage Δ
...entation.Runtime/MeterProviderBuilderExtensions.cs 85.71% <85.71%> (ø)
....Contrib.Instrumentation.Runtime/RuntimeMetrics.cs 100.00% <100.00%> (ø)
...b.Instrumentation.Runtime/RuntimeMetricsOptions.cs 100.00% <100.00%> (ø)

@cijothomas
Copy link
Member

I will proceed and merge the PR to make progress.

  1. Create the framing (i.e project, workflow, extension method to configure etc). -- this PR. (We can revisit the actual instrument name, logic in follow ups)
  2. Support one metric, and add usage example
  3. Add more metrics which are straightforward to implement.
  4. Release a alpha nuget with couple of metrics.
  5. Add more metrics...
  6. Based on feedback, we can adjust subsequent releases.

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for an initial merge. Lets follow up with more fixes.

@cijothomas cijothomas mentioned this pull request Mar 3, 2022
@cijothomas cijothomas merged commit 693e261 into open-telemetry:main Mar 3, 2022
@reyang reyang mentioned this pull request Mar 4, 2022
swetharavichandrancisco pushed a commit to swetharavichandrancisco/opentelemetry-dotnet-contrib that referenced this pull request Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants