Runtime instrumentation #207

twenzel · 2022-02-16T22:58:07Z

This PR implements Runtime instrumentation (#204 ).

It create instrumentation for all runtime metrics unless configured.

All information is retrieved directly (similar to RuntimeEventSource).
There are only metrics for values which can be retrieved via Runtime API.

alanwest · 2022-02-17T00:30:38Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/Implementation/AssembliesInstrumentation.cs

+        public AssembliesInstrumentation(Meter meter)
+        {
+            this.meter = meter;
+            this.assemblyCounter = meter.CreateObservableCounter("assembly-count", () => AppDomain.CurrentDomain.GetAssemblies().Length, description: "Number of Assemblies Loaded");


I think the names for the metrics in this PR should be prefixed with the process.runtime.dotnet. namespace. In this case something like process.runtime.dotnet.assembly_count (using an _ rather than - since I think that's more commonly what I've seen in other semantic conventions).

The spec has some general guidance here for runtime metrics: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/runtime-environment-metrics.md#runtime-environment-specific-metrics---processruntimeenvironment

Thank you. Names are now prefixed.

alanwest · 2022-02-17T00:49:26Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/Implementation/PerformanceInstrumentation.cs

+            this.meter = meter;
+            this.eventCounterStore = eventCounterStore;
+
+            this.eventCounterStore.Subscribe(CpuTimeCounterName, EventCounterType.Mean);


The instrumentation classes should probably implement IDisposable and clean up subscriptions in the event the MeterProvider is shutdown.

Great hint. Subscription clean up is now done on Dispose.

hananiel · 2022-02-18T02:49:40Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/Implementation/EventCounterListener.cs

+            try
+            {
+                this.refreshIntervalDictionary = new Dictionary<string, string>();
+                this.refreshIntervalDictionary.Add("EventCounterIntervalSec", "1");


@twenzel this is doing exactly same as #177 as far as "configure the polling schedule" - except its hardcoded to 1 second. This PR is making this specific to runtime metrics as opposed to any available event counters including what other libraries/ customers might already have in place. @reyang is this what you had in mind for #207 - to restrict to runtime metrics?

This PR is dedicated to the runtime metrics and is using the API directly to retrieve the values.
Unfortunately sometimes the API is internal, so we've to fallback to the EventCounter.
The advantage of a this instrumentation is that it can use the direct API as soon as the runtime make it public. It will perform better as retrieving the values from event counters with less overhead.
The EventCounter instrumentation from #177 is a general purpose implementation to retrieve any existing EventCounter.

I think ideally we want this "Runtime Instrumentation" to use the OpenTelemetry Metrics API rather than relying on the EventCounter. The main reason is efficiency and accuracy - especially considering Prometheus scenario that using an asynchronous instrument will make sure we only invoke the callback when Prometheus is scaping metrics, rather than having to poll the EventCounter based on some predetermined schedule.

If there are cases where we cannot use OpenTelemetry Metrics API (e.g. there is no proper API - such as ObservableUpDownCounter) or we cannot access the internal functionality of .NET runtime (e.g. if a certain EventCounter is implemented by calling some non-public API), we should discuss and see if we can find better approaches.

Unfortunately sometimes the API is internal, so we've to fallback to the EventCounter.

I suggest that we call out the list of APIs, and we should work with the .NET Runtime team to see if it makes sense to make these public.

@hananiel the goal for #204 is to have a pure OpenTelemetry Metrics API based instrumentation library for .NET runtime - that does not rely on EventCounter at all (as a consequence, provides better efficiency, sampling accuracy and probably better mode - e.g. different gen of GC could be modeled as metric dimensions/attributes rather than each modeled as a separate metric).

@reyang These are the APIs that are internal and should be made public:

GC.GetLastGCPercentTimeInGC()

GC.GetGenerationSize() // for generation 0-4

System.Reflection.Assembly.GetAssemblyCount()

Exception.GetExceptionCount()

RuntimeEventSourceHelper.GetCpuUsage()

Should I make PRs or are you discuss these requests internally?

@cijothomas and I had a discussion with the .NET runtime team (@maryamariyan, @noahfalk and @tarekgh), here goes a quick summary:

Some of the methods (e.g. RuntimeEventSourceHelper.GetCpuUsage()) were introduced into the runtime to reduce dependencies, there is likely going to be public APIs in other .NET runtime assemblies that meet the requirements here.

Some methods might have no good alternatives that are public. From .NET perspective, @noahfalk can help to see what's the direction - e.g. we might make certain API public in the future version of runtime, or design some public APIs that provide the same or even more feature.

Given this OpenTelemetry package will have to work with existing versions of the runtime, we need use reflection (probably in combination with some conditional compilation or runtime detection), for the functions that we took reflection approach here, .NET team could add comment in the source code to avoid breaking OpenTelemetry accidentally - before we can fully move to public APIs.

All of the above doesn't need to block this PR, we can address these 5 missing instruments in separate PRs.

@noahfalk to share more insights.

GC.GetGenerationSize() could be replaced with GC.GetGCMemoryInfo().GenerationInfo[i].SizeAfterBytes

GetCpuUsage() is a bit messy design-wise because the computation depends on how long it has been since the API was last called. It works OK if there is only one consumer of the counter, but imagine what happens if there are two exporters and each of them invokes the callback. If we imagine each of them is polling on the same interval (lets say once per minute), then one of them will be reporting the average CPU usage over the last minute and the other one will be reporting the average CPU usage over the last ~1ms since the previous exporter invoked the callback. That tiny sampling window could give very skewed results. We might need to redefine the metric or reconsider how it should be computed. The underlying calculation should be something like:
At the beginning and end of an interval evaluate System.Diagnostics.Process.GetCurrentProcess().TotalProcessorTime, subtract the end from the beginning, and divide by (time_interval * Environment.ProcessorCount).

For the other three APIs I created a runtime issue to track adding them: dotnet/runtime#65989

@reyang @cijothomas - Talking with Maoni about the percent-time-in-gc counter a similar issue arises that I see with the cpu % counter. It seems like what we'd want to do is represent these things as ObservableCounters which measure the number of seconds where something was true:
CPU = seconds/interval where CPU was in use
GC = seconds/interval where GC had all threads paused

Take an example to make sure I understand the proposal:

Time since program started Total time when GC had all threads paused

1.5 seconds 0.1 second

6.5 seconds 1.1 second

16.5 seconds 2.1 second

T0: 0s

T1: 1.5s

T2: 6.5s

T3: 16.5s

Option 1:

ObservableCounter("process.runtime.dotnet.gc.time")

at T1, record 0.1 // (T0, T1]

at T2, record 1.1 // (T0, T2]

at T3, record 2.1 // (T0, T3]

The consumer can configure the exporter to use delta temporality, so proper rate can be calculated downstream.

Option 2:

at T1, record 0.067 // 0.1 / 1.5

at T2, record 0.2 // (1.1 - 0.1) / (6.5 - 1.5)

at T3, record 0.1 // (2.1 - 1.1) / (16.5 - 6.5)

It seems Option 2 won't work, as the data being recorded won't fit any existing instrument (unless we want to pick something like async Gauge).

Yep exactly, option 1 is what I was suggesting we should start doing and option 2 is the problematic path that the cpu-usage EventCounter is doing right now.

Option (2) works ok with an ObservableGauge only if there is guaranteed exactly one observer/exporter, allowing the callback to infer the interval by timestamping when it was last called. As soon as there is a 2nd observer then the callback can't distinguish if the call at T2 is a new observer trying to measure the interval [0, T2] or the same observer that called at T1 now trying to measure [T1, T2].

At risk of getting a little into the weeds, the current runtime implementation of GetLastGCPercentTimeInGC() doesn't capture a timestamp of when it was last called, but the way it avoids doing that produces a measurement people are likely to find confusing. In your example above, imagine that the various GCs occured at:
GC 1: 1.0-1.2 (0.1 sec paused)
GC 2: 2.5-3.9 (0.9 sec paused)
GC 3: 4.0-4.1 (0.1 sec paused)
GC 4: 15.0-16.2 (1 sec paused)
You can see GCs can also do some work while the threads aren't paused. Adding up the pause times gives the same 0.1, 1.1 and 2.1 measurements as above when observed at T=1.5, 6.5, and 16.5.
If you measure at some time T, GetLastGCPercentTimeInGC() finds the most recent GC X that finished before time T and calculates GC_x_pause/ ( GC_x_end - GC_x-1_end). For example if you query at T=6.5 it determines GC 3 was the most recent completed GC before that point which ended at 4.1. Before that GC 2 ended at 3.9. That gives 0.1 pause time/(4.1-3.9) = 50% time in GC. Users were probably hoping to see (1.1-0.1)/(6.5-1.5)=20% time in GC and will be surprised not to see that. So the current GC stats don't suffer non-determinism when subjected to multiple observers, but only by using a surprising calculation that isn't necessarily a good approximation of what users really wanted.

omamoo · 2022-02-28T08:41:09Z

@twenzel Can you make the MetricPrefix configurable? process_runtime_dotnet is prefix matching Prometheus but differ than Datadog metrics for example.

omamoo · 2022-02-28T14:22:51Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/Implementation/GcInstrumentation.cs

+            this.eventCounterStore.Subscribe(LohSizeCounterName, EventCounterType.Mean);
+            this.eventCounterStore.Subscribe(PohSizeCounterName, EventCounterType.Mean);
+
+            this.gcHeapSizeCounter = meter.CreateObservableCounter($"{RuntimeMetrics.MetricPrefix}gc_heap_size", () => (double)(GC.GetTotalMemory(false) / 1_000_000), "MB", "GC Heap Size");


Can I ask a question?
Why ObservableCounter and not ObservableGauge?

From the docs:
ObservableCounter - ObservableCounter is an Instrument which reports monotonically increasing value(s)
Observable Gauge - ObservableGauge is an asynchronous Instrument which reports non-additive value(s)

The heap size is non-additive value and can decrease.

omamoo · 2022-02-28T14:58:06Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/Implementation/GcInstrumentation.cs

+            this.gen0GCCounter = meter.CreateObservableGauge($"{RuntimeMetrics.MetricPrefix}gen_0-gc_count", () => GC.CollectionCount(0), description: "Gen 0 GC Count");
+            this.gen1GCCounter = meter.CreateObservableGauge($"{RuntimeMetrics.MetricPrefix}gen_1-gc_count", () => GC.CollectionCount(1), description: "Gen 1 GC Count");
+            this.gen2GCCounter = meter.CreateObservableGauge($"{RuntimeMetrics.MetricPrefix}gen_2-gc_count", () => GC.CollectionCount(2), description: "Gen 2 GC Count");
+            this.allocRateCounter = meter.CreateObservableGauge($"{RuntimeMetrics.MetricPrefix}alloc_rate", () => GC.GetTotalAllocatedBytes(), "B", "Allocation Rate");


Questions again here, GC.GetTotalAllocatedBytes is IncrementingPollingCounter
RuntimeEventSource

From the docs - Provides a variant of EventCounter for variables that are ever-increasing, such as the number of exceptions in the runtime

Why not to use ObservableCounter?

cijothomas · 2022-02-28T19:07:40Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/RuntimeMetricsOptionsExtensions.cs

+    /// <summary>
+    /// Extensions to configure the runtime metrics options.
+    /// </summary>
+    public static class RuntimeMetricsOptionsExtensions


I think we don't need to have this (atleast in an initial release).As RuntimeMetricsOptions already has public properties to enable each category of metrics.

the general idea - start with the absolute minimum public API. We can always add more public APIs based on feedbacks from initial release.

the general idea - start with the absolute minimum public API. We can always add more public APIs based on feedbacks from initial release.

+1 👍

cijothomas

Overall approach seems good for me , except that we need to remove EventCounter specifics from this PR. It'll limit the number of metrics available, but we can work with runtime team to expose more, without going through EventCounters.

Will do another review once EventCounter stuff is removed (that should reduce the PR size by half)

src/OpenTelemetry.Contrib.Instrumentation.Runtime/Implementation/AssembliesInstrumentation.cs

src/OpenTelemetry.Contrib.Instrumentation.Runtime/RuntimeMetrics.cs

...lemetry.Contrib.Instrumentation.Runtime/OpenTelemetry.Contrib.Instrumentation.Runtime.csproj

cijothomas · 2022-03-01T02:01:27Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/README.md

+```csharp
+using var meterProvider = Sdk.CreateMeterProviderBuilder()
+    .AddRuntimeMetrics(options => options
+        .WithGc()


modify the examples to match the updated version.

src/OpenTelemetry.Contrib.Instrumentation.Runtime/RuntimeMetricsOptions.cs

cijothomas · 2022-03-01T02:04:50Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/RuntimeMetricsOptions.cs

+        /// <summary>
+        /// Gets a value indicating whether all metrics are enabled.
+        /// </summary>
+        internal bool IsAllEnabled => this.GcEnabled == null


have you considered making everything opt-in? i.e by default, user gets nothing by simply adding AddRunTimeMetrics(). One needs to explicitly enable individually to get them.

So then the configure parameter in MeterProviderBuilderExtensions.AddRuntimeMetrics should be mandatory, right?. Otherwise it makes no sense to add AddRuntimeMetrics().

we could make some of them as enabled by default.

@cijothomas ok, which one should be enabled by default? "Performance"?

pick anything random for now or make all Enabled by default. We can merge and get a nuget out, and start collecting some feedback, and adjust accordingly.

noahfalk · 2022-03-01T04:29:11Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/RuntimeMetrics.cs

+            if (options.IsGcEnabled)
+            {
+                this.meter.CreateObservableGauge($"{options.MetricPrefix}gc_heap_size", () => (double)(GC.GetTotalMemory(false) / 1_000_000), "MB", "GC Heap Size");
+                this.meter.CreateObservableGauge($"{options.MetricPrefix}gen_0-gc_count", () => GC.CollectionCount(0), description: "Gen 0 GC Count");


I would suggest we use a single separator rather than a mixture of both "-" and "_".

I would propose int64 process.runtime.dotnet.gc.heap, and the unit is Bytes.

For GC count, use process.runtime.dotnet.gc.count, and put gen as dimension/attribute.

Doesn't have to block this PR though.

noahfalk · 2022-03-01T04:35:14Z

src/OpenTelemetry.Contrib.Instrumentation.Runtime/RuntimeMetrics.cs

+            }
+#endif
+
+            if (options.IsPerformanceEnabled)


Perhaps this should be named "Memory", or "Process Memory"? "Performance" seems very broad.

This "category" is used to provide the cpu-usage as well, which was removed for the 1st version.
What about "ProcessPerformance"?

I noticed later that we were planning to prefix all these metrics with "process.runtime.dotnet." and the OT semantic conventions suggest this value should be named something like "process.memory.xyz". Maybe we should leave this instrument out of this Meter and put it with a different collection of non-runtime process information? Then the category would be gone and the name wouldn't matter.

@noahfalk Does the semantics didn't say "process.runtime.xyz.memory.usage" ?

Sorry I'm not sure I understood the question?
I believe the semantics are distinguishing memory allocated by a runtime's GC (process.runtime.dotnet.something) vs. all memory allocated by the process for any purpose (process.memory.something). The original EventCounters for the runtime were created prior to these semantic conventions existing and they weren't attempting to separate process-wide information from runtime-specific information.

Most of the OpenTelemetry semantic conventions are still at very early stages.
Take process.memory.usage and process.memory.virtual for example, the names are bit confusing, and memory measurements are much broader if we consider things like private working set, shared working set, commit size, reserved size, managed vs. native/unmanaged memory.

If I understand @noahfalk correctly, I think the intention is to put .NET runtime specific things under "process.runtime.dotnet.*", and put operating system specific things under a different place that doesn't have "runtime.dotnet" in the name. In other words, we need to ask ourselves - would this metric also apply to C++, if yes, it should not have "dotnet" in the name.

reyang

There are many nice discussions and I think all of them can be addressed in follow up PRs.

I would recommend that in this PR we focus on making the CI clean (100% pass), leave the other nice things for follow up PRs (and I'm happy to send a follow up PR if that helps).

codecov · 2022-03-03T02:15:19Z

Codecov Report

Merging #207 (69ec265) into main (5f4fce1) will increase coverage by 0.49%.
The diff coverage is 98.30%.

@@            Coverage Diff             @@
##             main     #207      +/-   ##
==========================================
+ Coverage   78.36%   78.85%   +0.49%     
==========================================
  Files          89       92       +3     
  Lines        2334     2393      +59     
==========================================
+ Hits         1829     1887      +58     
- Misses        505      506       +1

Impacted Files	Coverage Δ
...entation.Runtime/MeterProviderBuilderExtensions.cs	`85.71% <85.71%> (ø)`
....Contrib.Instrumentation.Runtime/RuntimeMetrics.cs	`100.00% <100.00%> (ø)`
...b.Instrumentation.Runtime/RuntimeMetricsOptions.cs	`100.00% <100.00%> (ø)`

cijothomas · 2022-03-03T02:34:44Z

I will proceed and merge the PR to make progress.

Create the framing (i.e project, workflow, extension method to configure etc). -- this PR. (We can revisit the actual instrument name, logic in follow ups)
Support one metric, and add usage example
Add more metrics which are straightforward to implement.
Release a alpha nuget with couple of metrics.
Add more metrics...
Based on feedback, we can adjust subsequent releases.

cijothomas

Looks good for an initial merge. Lets follow up with more fixes.

twenzel added 2 commits February 16, 2022 23:53

Initial version

0ce19d6

Fix tests

b37d5c8

twenzel requested a review from a team February 16, 2022 22:58

twenzel added 2 commits February 17, 2022 00:10

Add workflows

70e56c0

Fix markdown issues

10887a2

alanwest reviewed Feb 17, 2022

View reviewed changes

twenzel added 3 commits February 17, 2022 08:09

Unsubscribe on shutdown

3033276

Improve metric names

7c101bd

Fix warning

e5b550b

hananiel reviewed Feb 18, 2022

View reviewed changes

omamoo reviewed Feb 28, 2022

View reviewed changes

twenzel added 2 commits February 28, 2022 19:49

Make prefix configurable

9a806fa

Fixed metrics

e55e8eb