Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Do Not Merge] Add description doc for DotNet Runtime metrics #404

Closed
wants to merge 15 commits into from

Conversation

xiang17
Copy link
Contributor

@xiang17 xiang17 commented Jun 7, 2022

For issue #335.

Creating a doc for a description for each runtime metrics. This doc will both serve as a design doc and as a formal descriptive doc for users in future.

I'd like to ask for clarification on a few metrics that I'm not sure whether they are monotonically increasing or not.

Can I assume the metrics in the EventCounter implementation using IncrementingPollingCounter are all monotonically increasing? More specifically on this one:

  • GC.GetTotalAllocatedBytes()

I think the following metrics are monotonically increasing but the EventCounter implementation used PollingCounter rather than IncrementingPollingCounter:

  • il.bytes.jitted: System.Runtime.JitInfo.GetCompiledILBytes()
  • methods.jitted.count: System.Runtime.JitInfo.GetCompiledMethodCount()
  • assembly.count: AppDomain.GetAssemblies().Length

I think these are monotonically increasing, but would like to confirm:

  • Process.UserProcessorTime and Process.PrivilegedProcessorTime (for current process as retrieved by: Process.GetCurrentProcess())

@xiang17 xiang17 requested a review from a team June 7, 2022 19:50

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|-----------------------------------------------|--------------------------|-----------|-------------------|------------|------------------|------------------|
| process.runtime.dotnet.**gc.totalmemorysize** | Total allocated bytes | `By` | ObservableGauge | `Int64` | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think many of the ObservableGauge here should be ObservableUpDownCounter once .NET has the support for up-down counters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll probably need to to get more precise on this one. Some different possible interpretations:

  • Total amount of memory ever allocated from the GC since process start
  • Total amount of memory commited by the GC currently
  • Total amount of memory for live .NET objects + memory for free objects (objects that existed in the past and were collected but the memory hasn't yet been reused)
  • Total amount of memory for live .NET objects only
  • Total amount of memory for rooted .NET objects (objects that would still be live after a full GC)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some raw notes from our discussion:

process.runtime.dotnet.gc.heapsize: live object (not necessarily hold by the user objects/threads, could be hold by the GC itself) + fragmentation

all the gen adding up should <= committed (since GC commits at the granularity of page, and it normally would ask for multiple pages instead of just one for caching purpose)

neither of the above includes GC's own memory usage

github.com/Maoni0/mem-doc/blob/master/doc/.NETMemoryPerformanceAnalysis.md

GC.GetGenerationSize(n) is not providing the transactional behavior since we have to call it multiple times for different gen, it is okay as long as we mention it in the description/doc

remove gc.totalmemorysize since it can be covered by adding up different gens

Gets the number of timers that are currently active. An active timer is registered
to tick at some point in the future, and has not yet been canceled.

## Process related metrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general question - do we want these to be in the OpenTelemetry.Instrumentation.Runtime package, or we want them to be in a different package (and if yes, what should be the name of that package)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote separate package and OpenTelemetry.Instrumentation.Process as a potential name?

Copy link
Contributor Author

@xiang17 xiang17 Jun 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a PR to remove it: #446, and an issue to add them in a new package: #447.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the CPU metrics percentage collector... and will also push that to new namespace... #437 thx @reyang for ping about info of new namespace...

@utpilla utpilla added the comp:instrumentation.runtime Things related to OpenTelemetry.Instrumentation.Runtime label Jun 8, 2022

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|-------------------------------------------|-----------------------------|----------------|-------------------|------------|------------------|------------------|
| process.runtime.dotnet.**assembly.count** | Number of Assemblies Loaded | `{assemblies}` | ObservableCounter | `Int64` | | |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.NET runtime has the ability to unload assemblies, so AppDomain.GetAssemblies().Length could decrease in some occasions. The instrument type should be ObservableUpDownCounter (before ObservableUpDownCounter is ready, I'll use ObservableGauge in implementation). I'll update the doc later.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment is for .NET Framework and not for .NET Core as .NET Core doesn't allow more than one appomain in the process.

AppDomain.GetAssemblies().Length could decrease in some occasions.

I am not sure if this is accurate either. You may unload the whole appdomain and not individual assembly. So, technically the AppDomain.GetAssemblies().Length doesn't decrease. but the total assemblies in the process can decrease.

In .NET Core, uses AssemblyLoadContext for unloading all assemblies in that context by unloading the context.

CC @noahfalk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested it, but I would assume that assemblies in any AssemblyLoadContext, including an unloadable one, appear in the AppDomain.GetAssemblies() list. In that case AppDomain.GetAssemblies().Length could decrease because a load context unloaded.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will such a counter have really a useful meaning in .NET core apps? And how can users handle such a counter in .NET Framework apps? or we don't care much about .NET Framework cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My answer is YES. I can think of a valid use case - certain online service / background job run out of memory after hours and the engineers can tell from the "assembly loaded" metrics that the process kept loading plugins without unloading them.

This happened in my previous team, where a backend processing job detects new XML schema and compiles it to assembly (for sake of perf) and load it dynamically, the backend job ended up loading tens of thousands of assemblies.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good then!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Tarek and Noah! I'm getting the assembly count for the "current domain" AppDomain.CurrentDomain.GetAssemblies().Length, and the metric has to work for both .NET Framework and .NET Core. If the value could decrease in any case, I couldn't use ObservableCounter because a metric data point could be dropped.

I found one remark about a special case when the assembly is in the default application domain:

If an assembly is loaded into the default application domain, it cannot be unloaded from memory while the process is running.

If Noah's right about assemblies in AssemblyLoadContext would appear in AppDomain.CurrentDomain.GetAssemblies(), then I should use ObservableGauge (and use ObservableUpDownCounter when available). Otherwise it might lead to loss of data.

Copy link

@tarekgh tarekgh Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting the assembly count for the "current domain" AppDomain.CurrentDomain.GetAssemblies().Length

If we are reporting the number from the current appdomain only, I'll question the @reyang's scenario certain online service / background job run out of memory after hours and the engineers can tell from the "assembly loaded" metrics that the process kept loading plugins without unloading them. when running on .NET Framework. Looks what we report wouldn't be enough for that scenario.

To clarify, .NET Framework can have multiple appdomains. I guess it will be more useful reporting the total count of assemblies in all appdomains and not necessary the current appdomain.

In .NET Core, there is only one appdomain, so reporting the number from the current appdomain would work, I guess. We need to confirm loading and unloading AssemblyLoadContext is going to be reflected to the number we report. When I get some time, I can try to do that.

Copy link

@tarekgh tarekgh Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed in .NET Core apps when unloading the AssemblyLoadContext, the unloaded assemblies inside that context will be reflected in the AppDomain.CurrentDomain.GetAssemblies(). In other word, the number of assemblies in the appsomain can decrease.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the instrument type to ObservableGauge.

For implementation on getting all AppDomains, I found three articles 1, 2 and 3, but they all seem like very old school code depending on MSCorEE.dll.

I still need to figure out whether I could use those APIs, or find better alternatives.

| Name | Description | Units | Instrument Type | Value Type | Attribute Key(s) | Attribute Values |
|-----------------------------------------------|--------------------------|-----------|-------------------|------------|------------------|------------------|
| process.runtime.dotnet.**gc.totalmemorysize** | Total allocated bytes | `By` | ObservableGauge | `Int64` | | |
| process.runtime.dotnet.**gc.count** | Garbage Collection count | `{times}` | ObservableCounter | `Int64` | gen | gen0, gen1, gen2 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suggested by @noahfalk - for folks who have been using the EventCounters, we might consider something like "legacy time in GC" and make it an explicit opt-in (with a clear API name indicating that it has an unclear definition, and we don't recommend it).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw we should be able to give a precise definition for the legacy counter too, it just isn't always useful or intuitive. I'm pretty sure the existing counter is computed like this:

GC_n = most recent GC that has ended prior to computing the value
GC_n-1 = GC that ended prior to GC_n

T_end_n = time when GC_n ended
T_end_n-1 = time when GC_n-1 ended
T_start_n = time when GC_n started

inter_gc_time = T_end_n - T_end_n-1
gc_duration = T_end_n - T_start_n
counter value = 100 * gc_duration / inter_gc_time

As an example if GC_n-1 ran from 2.1 sec to 2.2 sec and GC_n ran from 2.5 sec to 3.0 sec then
inter_gc_time = 3.0 - 2.2 = 0.8
gc_duration = 3.0 - 2.5 = 0.5
counter_value = 100 * 0.5 / 0.8 = 62.5%

People usually assume that the value measures the fraction of GC time during a fixed sampling interval (say every 5 minutes) but really it is measuring the fraction of GC time during a variable interval between the most recent GC and the one before that. If the most recent two GCs happen to be close together the counter may appear unexpectedly high and this often causes confusion.

@@ -0,0 +1,175 @@
# Runtime metrics description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Runtime metrics overview" or "Runtime metric details" here instead of "description"? Also we should link to this from README.

Comment on lines +3 to +7
Metrics name are prefixed with the `process.runtime.dotnet.` namespace, following
the general guidance for runtime metrics in the
[specs](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/runtime-environment-metrics.md#runtime-environment-specific-metrics---processruntimeenvironment).
Instrument Units [should follow](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/README.md#instrument-units)
the Unified Code for Units of Measure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-worded this a bit:

Metric names are prefixed with the process.runtime.dotnet namespace following
the general guidance in the OpenTelemetry Specification. Instrument units follow the Unified Code for Units of Measure as outlined in the OpenTelemetry specification.


[JitInfo.GetCompiledMethodCount](https://docs.microsoft.com/dotnet/api/system.runtime.jitinfo.getcompiledmethodcount?view=net-6.0#system-runtime-jitinfo-getcompiledmethodcount(system-boolean)):
Gets the number of methods that have been compiled.
The scope of this value is global.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to state The scope of this value is global. on all of these? Could probably just drop that since it seems obvious (IMO) or move it to the heading so we only state it once? 🤷

The metrics in this section can be enabled by setting the
`RuntimeMetricsOptions.IsJitEnabled` switch.

These metrics are only available for NET6_0_OR_GREATER.
Copy link
Member

@CodeBlanch CodeBlanch Jun 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the preprocessor directives in this doc like NET6_0_OR_GREATER is very technical. Maybe just say like...

These metrics are available when targeting .NET6 or later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is opened to just solicit feedback only.

@reyang reyang changed the title Add description doc for DotNet Runtime metrics [Do Not Merge] Add description doc for DotNet Runtime metrics Jun 22, 2022
@xiang17 xiang17 closed this Jun 28, 2022
@xiang17
Copy link
Contributor Author

xiang17 commented Jun 29, 2022

Closed in favor of #475.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:instrumentation.runtime Things related to OpenTelemetry.Instrumentation.Runtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants