Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC Microbenchmarks Perf Regressions: Move from Segments to Regions #73592

Closed
mrsharm opened this issue Aug 8, 2022 · 2 comments
Closed

GC Microbenchmarks Perf Regressions: Move from Segments to Regions #73592

mrsharm opened this issue Aug 8, 2022 · 2 comments
Assignees
Labels
area-GC-coreclr tenet-performance Performance related issue
Milestone

Comments

@mrsharm
Copy link
Member

mrsharm commented Aug 8, 2022

Summary

A number of microbenchmarks regressed after moving from representing the GC Heap from Segments to Regions; the reason was attributed to the increase in the commit/decommits. Symptoms include a higher number of virtual commits and decommits as regions is more proactive in decommitting memory.

Perf Results Comparing NET 6.0 vs. NET 8.0

Regressed benchmarks:

Benchmark Name Baseline Comparand Baseline Mean Duration (MSec) Comparand Mean Duration (MSec) Δ Mean Duration (MSec) Δ% Mean Duration
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True) net6.0 net8.0 1186.49 1299.91 113.42 9.56
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True) net6.0 net8.0 658.78 719.09 60.31 9.16
System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536,65536 bits) net6.0 net8.0 2064.8 2213.4 148.6 7.2
System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536,65536 bits) net6.0 net8.0 2068.39 2207.75 139.37 6.74
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: True) net6.0 net8.0 200.84 213.92 13.08 6.51

This is an improvement from Net7.0 that exhibited the following regressions:

Benchmark Name Baseline Comparand Baseline Mean Duration (MSec) Comparand Mean Duration (MSec) Δ Mean Duration (MSec) Δ% Mean Duration
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True) net6.0 net7.0 1186.49 1465.42 278.93 23.51
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True) net6.0 net7.0 658.78 808.3 149.53 22.7
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: True) net6.0 net7.0 200.84 226.23 25.39 12.64
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True) net6.0 net7.0 2152.45 2404.26 251.81 11.7
System.Tests.Perf_GC.AllocateUninitializedArray(length: 1000, pinned: True) net6.0 net7.0 225.5 248.63 23.13 10.26
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: True) net6.0 net7.0 314.39 344.84 30.45 9.69
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True) net6.0 net7.0 1120.3 1226.81 106.51 9.51
System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536,65536 bits) net6.0 net7.0 2064.8 2235.52 170.72 8.27

Perf Results from Net7.0

Here are some of the regressions that are found locally running on the following configuration:

BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.795/21H2)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=7.0.100-rc.1.22375.2
[Host] : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT
Job-YRVDKS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

1. BenchmarksGame.BinaryTrees_5.RunBenchmark

Criteria Segments (ms) Regions (ms) % Diff
Mean 90.39 107.1 18.487%
Total Memory Cleared 21139281168 21318368648 0.847%
Total Memory Committed 652763136 1282842624 96.525%
Total Number of Virtual Commit Calls 9967 21460 115.311%
Total Number of Decommit Calls 1210 1681 38.926%

2. ByteMark.BenchLUDecomp

Criteria Segments (ms) Regions (ms) % Diff
Mean 879.7 910.5 3.383%
Total Memory Cleared 5208446912 5208677608 0.004%
Total Memory Committed 1023303680 3407753216 69.971%
Total Number of Virtual Commit Calls 15621 54233 71.197%
Total Number of Decommit Calls 301 1119 73.101%

All the associated regressed microbenchmark issues will be added below.

How To Repro

How To Repro the Issues With Instrumentation:

  1. Set the following environment variables:
    a. set complus_GCLogEnabled=1
    b. set complus_GCLogFileSize=100
    c. set complus_GCLogFile=c:\logs\<name>
    d. set complus_StressLog=0
  2. Update gc.cpp and gcpriv.h with instrumentation like this PR and build the runtimes with and without regions enabled.
  3. Run the individual microbenchmarks:
    a. cd <path to the performance repo>
    b. Run the benchmark using py .\scripts\benchmarks_ci.py --filter <Name of Benchmark e.g. ByteMark.BenchLUDecomp> -f net7.0 --corerun <Path to Corerun>
    c. Record the mean time from Benchmark.Net and the results from the instrumentation.
@mrsharm mrsharm added tenet-performance Performance related issue area-GC-coreclr labels Aug 8, 2022
@mrsharm mrsharm added this to the 7.0.0 milestone Aug 8, 2022
@ghost
Copy link

ghost commented Aug 8, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

The excessive commit/decommit behavior for regions (in comparison to segments) is regressing for a number of microbenchmarks. Symptoms include a higher number of virtual commits and decommits as regions is more proactive in decommitting memory.

Perf Results

Here are some of the regressions that are found locally running on the following configuration:

BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.795/21H2)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=7.0.100-rc.1.22375.2
[Host] : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT
Job-YRVDKS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

1. BenchmarksGame.BinaryTrees_5.RunBenchmark

Criteria Segments (ms) Regions (ms) % Diff
Mean 90.39 107.1 18.487%
Total Memory Cleared 21139281168 21318368648 0.847%
Total Memory Committed 652763136 1282842624 96.525%
Total Number of Virtual Commit Calls 9967 21460 115.311%
Total Number of Decommit Calls 1210 1681 38.926%

2. ByteMark.BenchLUDecomp

| Criteria | Segments (ms) | Regions (ms) | % Diff |
| -- | -- | -- |
| Mean | 879.7 | 910.5 | 3.383%
| Total Memory Cleared | 5208446912 | 5208677608 | 0.004%
| Total Memory Committed | 1023303680 | 3407753216 | 69.971%
| Total Number of Virtual Commit Calls | 15621 | 54233 | 71.197%
| Total Number of Decommit Calls | 301 | 1119 | 73.101%

All the associated regressed microbenchmark issues will be added below.

How To Repro

How To Repro the Issues With Instrumentation:

  1. Set the following environment variables:
    a. set complus_GCLogEnabled=1
    b. set complus_GCLogFileSize=100
    c. set complus_GCLogFile=c:\logs\<name>
    d. set complus_StressLog=0
  2. Update gc.cpp and gcpriv.h with instrumentation like this PR and build the runtimes with and without regions enabled.
  3. Run the individual microbenchmarks:
    a. cd <path to the performance repo>
    b. Run the benchmark using py .\scripts\benchmarks_ci.py --filter <Name of Benchmark e.g. ByteMark.BenchLUDecomp> -f net7.0 --corerun <Path to Corerun>
    c. Record the mean time from Benchmark.Net and the results from the instrumentation.
Author: mrsharm
Assignees: -
Labels:

tenet-performance, area-GC-coreclr

Milestone: 7.0.0

@mrsharm
Copy link
Member Author

mrsharm commented Sep 12, 2023

Conclusion 1: The 3 benchmarks out of the lot we tested that still need further investigation / improvements are:

  1. BenchmarksGame.BinaryTrees_6.RunBench
  2. System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: False)
  3. System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: False)

For 1. we found that there was an improvement in the results from comparing .NET 6’s SDK and .NET 6’s SDK with a clrgc.dll from .NET 8 Preview 7 AND .NET 8 with Segments and .NET 8 with a less precise Write Barrier and Write XOR Execute turned off (23% to 5%); this improvement, however, was still not enough to get us back to the same level as segments.

We still need to analyze 2. And 3. – we observed that running the default invocation count isn’t enough to capture enough samples while looking at the CPU traces.

Conclusion 2: Disabling Mark Phase Prefetch reversed regressions for several benchmarks. As far as I recall, we did have an exception for enablement of this feature.

The data are as follows:

List of Regressions With Mark Phase Prefetch (7)

Benchmark Name Baseline Comparand Baseline Mean Duration (MSec) Comparand Mean Duration (MSec) Δ Mean Duration (MSec) Δ% Mean Duration
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True) net6.0 net8.0 2611.66 3011.9 400.24 15.33
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: False) net6.0 net8.0 128.5 146.09 17.59 13.69
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True) net6.0 net8.0 1977.51 2152.29 174.78 8.84
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True) net6.0 net8.0 2568.05 2762.54 194.49 7.57
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: False) net6.0 net8.0 88.02 94.06 6.04 6.86
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True) net6.0 net8.0 1947.47 2066.27 118.8 6.1
System.Tests.Perf_Enum.GetNames_Generic net6.0 net8.0 20.22 21.28 1.06 5.23

List of Regressions Without Mark Phase Prefetch (2)

Benchmark Name Baseline Comparand Baseline Mean Duration (MSec) Comparand Mean Duration (MSec) Δ Mean Duration (MSec) Δ% Mean Duration
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: False) net6.0 net8.0 128.5 144.72 16.22 12.62
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: False) net6.0 net8.0 88.02 99.04 11.03 12.53

Conclusion 3: A lot of the regressions for which we were assigned bugs for no longer show up as regressions after conducting a comparison in a GC-centric way.

As a clear-cut example:

  1. We regressed Microsoft.Extensions.Primitives.StringSegmentBenchmark by 20%: [Perf] Windows/x64: 188 Regressions from GC changes #74014 (comment)
  2. Running the comparison in a more GC centric way with .NET 6 (net6.0) vs. .NET 6 + the clrgc from 8 (net8.0) we get an improvement of 0.57%:
    Benchmark Name Baseline Comparand Baseline Mean Duration (MSec) Comparand Mean Duration (MSec) Δ Mean Duration (MSec) Δ% Mean Duration
    Microsoft.Extensions.Primitives.StringSegmentBenchmark.SubString net6.0 net8.0 10.81 10.74 -0.06 -0.57

All Tests Run

  1. BenchmarksGame.BinaryTrees_6.RunBench
  2. System.Tests.Perf_Enum.GetNames_Generic
  3. System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True)
  4. System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)
  5. System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: False)
  6. System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True)
  7. System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: False)
  8. System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)
  9. System.Tests.Perf_Enum.GetName_Generic_NonFlags
  10. ByteMark.BenchLUDecomp
  11. System.Tests.Perf_Enum.GetName_NonGeneric_Flags
  12. System.Tests.Perf_Type.GetTypeFromHandle
  13. BenchmarksGame.BinaryTrees_2.RunBench
  14. BenchmarksGame.BinaryTrees_5.RunBench
  15. System.Xml.Linq.Perf_XName.NonEmptyNameSpaceToString
  16. System.Tests.Perf_String.Insert(s1: "Test", i: 2, s2: " Test")
  17. System.Tests.Perf_String.Insert(s1: "dzsdzsDDZSDZSDZSddsz", i: 7, s2: "Test")
  18. Microsoft.Extensions.Primitives.StringSegmentBenchmark.SubString

@mrsharm mrsharm closed this as completed Sep 12, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Oct 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

1 participant