Add runtime metrics in `System.Diagnostics.DiagnosticSource` #104680

stevejgordon · 2024-07-10T15:13:55Z

This adds new runtime metrics to align with the proposed semantic conventions. I am opening this as a draft until the naming is finalised in the OTel semantic conventions.

The metrics are essentially a port of the existing OTel contrib runtime metrics. Small changes have been made to optimise the performance and align the naming with those being proposed and discussed over in the semantic conventions repo.

cc @tarekgh and @noahfalk

...braries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/RuntimeMetrics.cs

lmolkova · 2024-07-10T21:43:35Z

@stevejgordon thanks a lot for doing this!

we synced offline and made some common suggestions on open-telemetry/semantic-conventions#1035 (comment). Hope it'd help you focus on this PR and get it merged 🤞

stevejgordon · 2024-07-11T07:05:38Z

@lmolkova, happy to contribute. Thanks for clarifying the suggestions in the semantic conventions. I've made those changes and aligned this PR with the updated naming.

tarekgh · 2024-07-11T16:14:31Z

@carlossanlop @adamsitnik @ericstj I am seeing the cyclic errors /__w/1/s/.packages/microsoft.dotnet.sharedframework.sdk/9.0.0-beta.24360.1/targets/sharedfx.targets(484,5): error : System.IO.Pipes > System.Net.Sockets > System.Net.NameResolution > System.Diagnostics.DiagnosticSource > System.Diagnostics.Process > System.IO.Pipes [/__w/1/s/src/installer/pkg/sfx/Microsoft.NETCore.App/Microsoft.NETCore.App.Runtime.sfxproj] ~~while this PR didn't add or change any dependencies. Any idea ideas why? the change started to call things like Process process = Process.GetCurrentProcess();.~~

tarekgh · 2024-07-11T16:59:05Z

I talked offline with @ericstj and looks we introduced the dependency on System.Diagnostics.Process which caused the cyclic dependency.

@noahfalk @samsp-msft @lmolkova @reyang do you see any issue if we move this feature to Microsoft.Extensions.Diagnostics library instead?

System.Diagnostics.DiagnosticSource is really low in the stack and it is good to reduce any dependencies there especially if we extend the runtime metrics in the future.

reyang · 2024-07-11T17:05:01Z

I talked offline with @ericstj and looks we introduced the dependency on System.Diagnostics.Process which caused the cyclic dependency.

@noahfalk @samsp-msft @lmolkova @reyang do you see any issue if we move this feature to Microsoft.Extensions.Diagnostics library instead?

No issue from what I can see, thanks!

stevejgordon · 2024-07-11T17:50:02Z

@noahfalk @samsp-msft @lmolkova @reyang do you see any issue if we move this feature to Microsoft.Extensions.Diagnostics library instead?

Unfortunately, there might be an issue with this. The implementation relies on the initialisation of the static meter from the ctor of MeterListener. We'd need an alternative way to ensure the Meter is created.

jkotas · 2024-07-11T18:08:05Z

What is the impact of the dependency closure of this new meter on the size of trimmed or native-aot compiled applications?

If this is going to be enabled by default, does it mean that this is introducing significant binary size regression for trimmer or native-aot compiled apps?

I think we either need to avoid taking large dependency closure in the implementation and then it is ok to enable it by default; or it needs to opt-in and then the large dependency closure is ok (not ideal though).

tarekgh · 2024-07-11T19:08:00Z

@jkotas we have the switch System.Diagnostics.Metrics.Meter.IsSupported to allow disabling the metrics. Is this can be acceptable mitigation for AOT if apps care about the size and not using metrics? I am just thinking loudly here.

Unfortunately, there might be an issue with this. The implementation relies on the initialisation of the static meter from the ctor of MeterListener. We'd need an alternative way to ensure the Meter is created.

Yes, I am trying to see if there are any other concerns that can prevent us moving it to the other library. For alternative way to create the Meter, I am thinking in the direction if we can keep creating the Meter in System.Diagnostics.DiagnosticSource but having instruments that are bringing the new dependencies be created from the extensions library. Of course we'll need to have a way to pick the meter instance from System.Diagnostics.DiagnosticSource. I am still thinking of other alternatives or viable solutions.

noahfalk · 2024-07-11T20:27:10Z

I talked offline with @ericstj and looks we introduced the dependency on System.Diagnostics.Process which caused the cyclic dependency.

I'm hopping between meetings and haven't had a chance to look more closely, but could we break the dependency by p/invoking to the relevant OS APIs directly? I believe on Windows its a call to GetProcessTimes() and on Linux a call to getrusage(). It doesn't seem like it would take much code to do that.

noahfalk · 2024-07-11T20:30:25Z

I think we either need to avoid taking large dependency closure in the implementation and then it is ok to enable it by default; or it needs to opt-in and then the large dependency closure is ok (not ideal though).

I would aim for the former - avoid taking large dependency closure. In theory runtime metrics should be nothing more than some calls into low-level runtime APIs (GC, JIT, assemblies, threadpool) + a tiny bit of data from the OS (some cpu and memory usage info)

tarekgh · 2024-07-11T20:49:35Z

@noahfalk there are more dependencies and not just Process one. Yes, process one is what is causing cycles, but I am not sure if we are ok for the other dependencies.

    <Reference Include="System.Diagnostics.Process" />
    <Reference Include="System.Threading.ThreadPool" />
    <Reference Include="System.ComponentModel.Primitives" />

tarekgh · 2024-07-11T21:07:27Z

I'm hopping between meetings and haven't had a chance to look more closely, but could we break the dependency by p/invoking to the relevant OS APIs directly? I believe on Windows its a call to GetProcessTimes() and on Linux a call to getrusage(). It doesn't seem like it would take much code to do that.

Not dismissing the idea but want to point out that we have different implementations on Linux, Windows, Bsd, Mac and maybe more. @ericstj had another idea if this functionality can be exposed from Environment class (of course this will need a new API).

ericstj · 2024-07-11T22:58:53Z

Pinvoking could be done via some runtime checks - you wouldn't want to introduce runtime specific builds to this component. We try not to do that in our nuget packages. Especially those that are used in older frameworks.

New API is blocker since that wouldn't be available on NetFx and older .NET versions.

tarekgh · 2024-07-11T23:13:36Z

New API is blocker since that wouldn't be available on NetFx and older .NET versions

I forgot about that 😢

samsp-msft · 2024-07-11T23:25:54Z

Is the meter something that can be registered with DI and then fetched later, or does that create a problem with then taking a DI dependency?

tarekgh · 2024-07-11T23:33:57Z

Is the meter something that can be registered with DI and then fetched later, or does that create a problem with then taking a DI dependency?

You cannot access the DI from DiagnosticSource library level. In addition to that we need the scenario work outside containers (like console apps for instance). But you are thinking in the same direction I was thinking #104680 (comment). The Meter and instruments not having new dependencies can be created in DiagnosticSource. And in Extensions library we can start a listener which will give us the created instruments and from the instrument we can get the Meter and create more instruments that need the dependency. I know this is complicated, but it might work. The only issue I am seeing with that is we need a startup code to execute when the extensions library gets loaded.

jkotas · 2024-07-12T04:19:14Z

In theory runtime metrics should be nothing more than some calls into low-level runtime APIs (GC, JIT, assemblies, threadpool) + a tiny bit of data from the OS (some cpu and memory usage info)

Yes, I agree. The immediate problem is caused by System.Diagnostics.Process that is a heavy weight component (last time I checked, any use of it brought in several MBs of code into native aot apps).

New API is blocker since that wouldn't be available on NetFx and older .NET versions.

It is fine to have a solution with good performance characteristics that uses new APIs for current .NET, and a solution with worse performance characteristics that depends on pre-existing APIs for downlevel platforms and hack around the static dependency cycles using reflection.

noahfalk · 2024-07-12T06:41:21Z

It is fine to have a solution with good performance characteristics that uses new APIs for current .NET, and a solution with worse performance characteristics that depends on pre-existing APIs for downlevel platforms and hack around the static dependency cycles using reflection.

I also think it would be fine if this feature doesn't exist on downlevel platforms at all, or it doesn't show up on downlevel platforms in this release. If it weren't for the dependency on the Meter API I would have happily implemented this feature in System.Private.CoreLib where there would never be any downlevel support.

there are more dependencies and not just Process one. Yes, process one is what is causing cycles, but I am not sure if we are ok for the other dependencies.

If we get rid of the System.Diagnostics.Process dependency that should also eliminate the System.ComponentModel.Primitives dependency. For threadpool I don't expect that is problematic as it is just a facade on S.P.C APIs. Downlevel we don't have to support the functionality. So I think Process is the only dependency that is problematic.

Not dismissing the idea but want to point out that we have different implementations on Linux, Windows, Bsd, Mac and maybe more.

The implementation inside of System.Diagnostics.Process is more complex because it has to handle the case where the process being inspected is not the current one. When we limit to current process only the complexity is much lower. The existing cpu-usage counter calls to RuntimeEventSourceHelper.GetCpuUsage which only has two platform specific implementations:
Unix (handles Linux, BSD, Mac, and others - ultimately calls into the PAL and uses getrusage)
Windows (calls to GetProcessTimes)

As best I can tell GetProcessTimes() and getrusage() covers all the cases where we'd need to make the data available, and then we've got cases like browser or WASI where the metric isn't going to be available. I'm certainly not opposed to adding a new public API, but I am guessing the quickest and easiest implementation would be a conditional check at runtime to determine the platform and then p/invoke to one of those two OS APIs.

@stevejgordon - do you want to give the p/invoke approach a try for cpu.time? I believe you'd want something like:

if(OperatingSystem.IsWindows())
{
   // pinvoke to GetProcessTimes()
}
else if(!OperatingSystem.IsWasi() && !OperatingSystem.IsBrowser())
{
   // pinvoke to getrusage()
}
else
{
   // no support for cpu.time on this platform
}

stevejgordon · 2024-07-12T06:50:29Z

@noahfalk, sure! I'll have a look at this today.

Ensure that each test only observes the expected Meter

stevejgordon · 2024-07-17T15:54:28Z

@noahfalk, I have tweaked the tests to accommodate some of the platforms not running a forced GC. I also tweaked the existing tests that were failing to ensure they only observe their expected Meters. I'm hoping this current run should go through on all the tests, but I will keep an eye on it.

tarekgh · 2024-07-17T16:17:15Z

@stevejgordon I have changed the PR to be not draft. I think we can merge once we get the green CI and no more comments need to be addressed. I can enable the cpu time metrics with my changes later as I need more time to finish exposing the new Environment API.

tarekgh · 2024-07-17T20:19:07Z

@stevejgordon there are some tests failing on WASM

Maybe we need to exclude running the metric tests on WASM. something like [ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.IsNotBrowser))]. We already do that in other metric tests.

stevejgordon · 2024-07-17T21:45:55Z

@tarekgh, sure. Will try that first thing tomorrow. I was expecting something like that might be needed somewhere.

noahfalk · 2024-07-18T01:11:27Z

Just a heads up, I lost internet connectivity at my house. I'm doing what I can from my cell phone but it's a subpar UI experience to say the least :)

stevejgordon · 2024-07-18T06:33:07Z

@noahfalk / @tarekgh It looks like the tests are happy now! The only failures seem to be known issues and unrelated.

tarekgh · 2024-07-18T15:37:59Z

Thanks @stevejgordon for your contribution.

stevejgordon · 2024-07-18T15:51:52Z

Happy to help! Thanks for the support as we got this over the line!

noahfalk · 2024-07-18T22:11:05Z

Thanks Steve!

dotnet-issue-labeler bot added the area-System.Diagnostics.Metric label Jul 10, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 10, 2024

This was referenced Jul 10, 2024

Add runtime metrics stevejgordon/runtime#1

Draft

Add initial experimental .NET CLR runtime metrics open-telemetry/semantic-conventions#1035

Merged

cijothomas reviewed Jul 10, 2024

View reviewed changes

...braries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/RuntimeMetrics.cs Show resolved Hide resolved

This was referenced Jul 11, 2024

System.IO.Net5Compat.Tests and System.IO.Tests suddenly exiting with error 137 #100558

Open

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

stevejgordon added 7 commits July 17, 2024 11:56

Add some test logging

024d2d0

Fix MetricsTests

8457396

Ensure that each test only observes the expected Meter

More test logging and extended wait for metrics

7708d0a

More logging and a test fix

fef38e1

Remove System.Diagnostics.Process

a5c348d

Update HeapTags test to assert correctly if no GC has run

e3662c9

Only include System.Threading.ThreadPool reference for net9.0

24af281

tarekgh marked this pull request as ready for review July 17, 2024 16:13

Merge branch 'main' into runtime-metrics

087756c

Skip tests on browser

5105ee8

build-analysis bot mentioned this pull request Jul 18, 2024

CI error: System.Net.Quic.QuicException: The connection timed out from inactivity #91757

Closed

tarekgh mentioned this pull request Jul 18, 2024

'System.Net.NameResolution.Functional.Tests' Failure #105092

Closed

tarekgh approved these changes Jul 18, 2024

View reviewed changes

tarekgh merged commit fd94c84 into dotnet:main Jul 18, 2024
80 of 84 checks passed

This was referenced Jul 21, 2024

[apple] Mobile failures in System.Diagnostics.DiagnosticSource runtime metrics #105202

Closed

[Android] Android x86 emulator failures in System.Diagnostics.DiagnosticSource runtime metrics #105203

Closed

tarekgh mentioned this pull request Jul 22, 2024

Fix Deadlock Inside Metrics Code #105259

Merged

noahfalk mentioned this pull request Jul 24, 2024

Docs/test updates/migration guidance for new runtime metrics in dotnet-counters dotnet/diagnostics#4806

Closed

noahfalk mentioned this pull request Aug 1, 2024

Initializing RuntimeMetrics #105845

Closed

noahfalk mentioned this pull request Aug 15, 2024

Retire IResourceMonitorPublisher in favor of observable instruments dotnet/extensions#4404

Closed

github-actions bot locked and limited conversation to collaborators Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add runtime metrics in `System.Diagnostics.DiagnosticSource` #104680

Add runtime metrics in `System.Diagnostics.DiagnosticSource` #104680

stevejgordon commented Jul 10, 2024

lmolkova commented Jul 10, 2024

stevejgordon commented Jul 11, 2024

tarekgh commented Jul 11, 2024 •

edited

Loading

tarekgh commented Jul 11, 2024 •

edited

Loading

reyang commented Jul 11, 2024

stevejgordon commented Jul 11, 2024

jkotas commented Jul 11, 2024

tarekgh commented Jul 11, 2024

noahfalk commented Jul 11, 2024

noahfalk commented Jul 11, 2024

tarekgh commented Jul 11, 2024

tarekgh commented Jul 11, 2024

ericstj commented Jul 11, 2024

tarekgh commented Jul 11, 2024

samsp-msft commented Jul 11, 2024

tarekgh commented Jul 11, 2024

jkotas commented Jul 12, 2024

noahfalk commented Jul 12, 2024 •

edited

Loading

stevejgordon commented Jul 12, 2024

stevejgordon commented Jul 17, 2024

tarekgh commented Jul 17, 2024

tarekgh commented Jul 17, 2024

stevejgordon commented Jul 17, 2024

noahfalk commented Jul 18, 2024

stevejgordon commented Jul 18, 2024

tarekgh commented Jul 18, 2024

stevejgordon commented Jul 18, 2024

noahfalk commented Jul 18, 2024

Add runtime metrics in System.Diagnostics.DiagnosticSource #104680

Add runtime metrics in System.Diagnostics.DiagnosticSource #104680

Conversation

stevejgordon commented Jul 10, 2024

lmolkova commented Jul 10, 2024

stevejgordon commented Jul 11, 2024

tarekgh commented Jul 11, 2024 • edited Loading

tarekgh commented Jul 11, 2024 • edited Loading

reyang commented Jul 11, 2024

stevejgordon commented Jul 11, 2024

jkotas commented Jul 11, 2024

tarekgh commented Jul 11, 2024

noahfalk commented Jul 11, 2024

noahfalk commented Jul 11, 2024

tarekgh commented Jul 11, 2024

tarekgh commented Jul 11, 2024

ericstj commented Jul 11, 2024

tarekgh commented Jul 11, 2024

samsp-msft commented Jul 11, 2024

tarekgh commented Jul 11, 2024

jkotas commented Jul 12, 2024

noahfalk commented Jul 12, 2024 • edited Loading

stevejgordon commented Jul 12, 2024

stevejgordon commented Jul 17, 2024

tarekgh commented Jul 17, 2024

tarekgh commented Jul 17, 2024

stevejgordon commented Jul 17, 2024

noahfalk commented Jul 18, 2024

stevejgordon commented Jul 18, 2024

tarekgh commented Jul 18, 2024

stevejgordon commented Jul 18, 2024

noahfalk commented Jul 18, 2024

Add runtime metrics in `System.Diagnostics.DiagnosticSource` #104680

Add runtime metrics in `System.Diagnostics.DiagnosticSource` #104680

tarekgh commented Jul 11, 2024 •

edited

Loading

tarekgh commented Jul 11, 2024 •

edited

Loading

noahfalk commented Jul 12, 2024 •

edited

Loading