GC Microbenchmarks Perf Regressions: Move from Segments to Regions #73592

mrsharm · 2022-08-08T21:21:48Z

Summary

A number of microbenchmarks regressed after moving from representing the GC Heap from Segments to Regions; the reason was attributed to the increase in the commit/decommits. Symptoms include a higher number of virtual commits and decommits as regions is more proactive in decommitting memory.

Perf Results Comparing NET 6.0 vs. NET 8.0

Regressed benchmarks:

Benchmark Name	Baseline	Comparand	Baseline Mean Duration (MSec)	Comparand Mean Duration (MSec)	Δ Mean Duration (MSec)	Δ% Mean Duration
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)	net6.0	net8.0	1186.49	1299.91	113.42	9.56
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)	net6.0	net8.0	658.78	719.09	60.31	9.16
System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536,65536 bits)	net6.0	net8.0	2064.8	2213.4	148.6	7.2
System.Numerics.Tests.Perf_BigInteger.Add(arguments: 65536,65536 bits)	net6.0	net8.0	2068.39	2207.75	139.37	6.74
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: True)	net6.0	net8.0	200.84	213.92	13.08	6.51

This is an improvement from Net7.0 that exhibited the following regressions:

Benchmark Name	Baseline	Comparand	Baseline Mean Duration (MSec)	Comparand Mean Duration (MSec)	Δ Mean Duration (MSec)	Δ% Mean Duration
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)	net6.0	net7.0	1186.49	1465.42	278.93	23.51
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)	net6.0	net7.0	658.78	808.3	149.53	22.7
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: True)	net6.0	net7.0	200.84	226.23	25.39	12.64
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True)	net6.0	net7.0	2152.45	2404.26	251.81	11.7
System.Tests.Perf_GC.AllocateUninitializedArray(length: 1000, pinned: True)	net6.0	net7.0	225.5	248.63	23.13	10.26
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: True)	net6.0	net7.0	314.39	344.84	30.45	9.69
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True)	net6.0	net7.0	1120.3	1226.81	106.51	9.51
System.Numerics.Tests.Perf_BigInteger.Subtract(arguments: 65536,65536 bits)	net6.0	net7.0	2064.8	2235.52	170.72	8.27

Perf Results from Net7.0

Here are some of the regressions that are found locally running on the following configuration:

BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.795/21H2)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=7.0.100-rc.1.22375.2
[Host] : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT
Job-YRVDKS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

1. BenchmarksGame.BinaryTrees_5.RunBenchmark

Criteria	Segments (ms)	Regions (ms)	% Diff
Mean	90.39	107.1	18.487%
Total Memory Cleared	21139281168	21318368648	0.847%
Total Memory Committed	652763136	1282842624	96.525%
Total Number of Virtual Commit Calls	9967	21460	115.311%
Total Number of Decommit Calls	1210	1681	38.926%

2. ByteMark.BenchLUDecomp

Criteria	Segments (ms)	Regions (ms)	% Diff
Mean	879.7	910.5	3.383%
Total Memory Cleared	5208446912	5208677608	0.004%
Total Memory Committed	1023303680	3407753216	69.971%
Total Number of Virtual Commit Calls	15621	54233	71.197%
Total Number of Decommit Calls	301	1119	73.101%

All the associated regressed microbenchmark issues will be added below.

How To Repro

How To Repro the Issues With Instrumentation:

Set the following environment variables:
a. set complus_GCLogEnabled=1
b. set complus_GCLogFileSize=100
c. set complus_GCLogFile=c:\logs\<name>
d. set complus_StressLog=0
Update gc.cpp and gcpriv.h with instrumentation like this PR and build the runtimes with and without regions enabled.
Run the individual microbenchmarks:
a. cd <path to the performance repo>
b. Run the benchmark using py .\scripts\benchmarks_ci.py --filter <Name of Benchmark e.g. ByteMark.BenchLUDecomp> -f net7.0 --corerun <Path to Corerun>
c. Record the mean time from Benchmark.Net and the results from the instrumentation.

The text was updated successfully, but these errors were encountered:

ghost · 2022-08-08T21:21:57Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

The excessive commit/decommit behavior for regions (in comparison to segments) is regressing for a number of microbenchmarks. Symptoms include a higher number of virtual commits and decommits as regions is more proactive in decommitting memory.

Perf Results

Here are some of the regressions that are found locally running on the following configuration:

BenchmarkDotNet=v0.13.1.1786-nightly, OS=Windows 11 (10.0.22000.795/21H2)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=7.0.100-rc.1.22375.2
[Host] : .NET 7.0.0 (7.0.22.36704), X64 RyuJIT
Job-YRVDKS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

1. BenchmarksGame.BinaryTrees_5.RunBenchmark

Criteria	Segments (ms)	Regions (ms)	% Diff
Mean	90.39	107.1	18.487%
Total Memory Cleared	21139281168	21318368648	0.847%
Total Memory Committed	652763136	1282842624	96.525%
Total Number of Virtual Commit Calls	9967	21460	115.311%
Total Number of Decommit Calls	1210	1681	38.926%

2. ByteMark.BenchLUDecomp

| Criteria | Segments (ms) | Regions (ms) | % Diff |
| -- | -- | -- |
| Mean | 879.7 | 910.5 | 3.383%
| Total Memory Cleared | 5208446912 | 5208677608 | 0.004%
| Total Memory Committed | 1023303680 | 3407753216 | 69.971%
| Total Number of Virtual Commit Calls | 15621 | 54233 | 71.197%
| Total Number of Decommit Calls | 301 | 1119 | 73.101%

All the associated regressed microbenchmark issues will be added below.

How To Repro

How To Repro the Issues With Instrumentation:

Set the following environment variables:
a. set complus_GCLogEnabled=1
b. set complus_GCLogFileSize=100
c. set complus_GCLogFile=c:\logs\<name>
d. set complus_StressLog=0
Update gc.cpp and gcpriv.h with instrumentation like this PR and build the runtimes with and without regions enabled.
Run the individual microbenchmarks:
a. cd <path to the performance repo>
b. Run the benchmark using py .\scripts\benchmarks_ci.py --filter <Name of Benchmark e.g. ByteMark.BenchLUDecomp> -f net7.0 --corerun <Path to Corerun>
c. Record the mean time from Benchmark.Net and the results from the instrumentation.

Author:	mrsharm
Assignees:	-
Labels:	`tenet-performance`, `area-GC-coreclr`
Milestone:	7.0.0

mrsharm · 2023-09-12T02:46:27Z

Conclusion 1: The 3 benchmarks out of the lot we tested that still need further investigation / improvements are:

BenchmarksGame.BinaryTrees_6.RunBench
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: False)
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: False)

For 1. we found that there was an improvement in the results from comparing .NET 6’s SDK and .NET 6’s SDK with a clrgc.dll from .NET 8 Preview 7 AND .NET 8 with Segments and .NET 8 with a less precise Write Barrier and Write XOR Execute turned off (23% to 5%); this improvement, however, was still not enough to get us back to the same level as segments.

We still need to analyze 2. And 3. – we observed that running the default invocation count isn’t enough to capture enough samples while looking at the CPU traces.

Conclusion 2: Disabling Mark Phase Prefetch reversed regressions for several benchmarks. As far as I recall, we did have an exception for enablement of this feature.

The data are as follows:

List of Regressions With Mark Phase Prefetch (7)

Benchmark Name	Baseline	Comparand	Baseline Mean Duration (MSec)	Comparand Mean Duration (MSec)	Δ Mean Duration (MSec)	Δ% Mean Duration
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True)	net6.0	net8.0	2611.66	3011.9	400.24	15.33
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: False)	net6.0	net8.0	128.5	146.09	17.59	13.69
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True)	net6.0	net8.0	1977.51	2152.29	174.78	8.84
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)	net6.0	net8.0	2568.05	2762.54	194.49	7.57
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: False)	net6.0	net8.0	88.02	94.06	6.04	6.86
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)	net6.0	net8.0	1947.47	2066.27	118.8	6.1
System.Tests.Perf_Enum.GetNames_Generic	net6.0	net8.0	20.22	21.28	1.06	5.23

List of Regressions Without Mark Phase Prefetch (2)

Benchmark Name	Baseline	Comparand	Baseline Mean Duration (MSec)	Comparand Mean Duration (MSec)	Δ Mean Duration (MSec)	Δ% Mean Duration
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: False)	net6.0	net8.0	128.5	144.72	16.22	12.62
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: False)	net6.0	net8.0	88.02	99.04	11.03	12.53

Conclusion 3: A lot of the regressions for which we were assigned bugs for no longer show up as regressions after conducting a comparison in a GC-centric way.

As a clear-cut example:

We regressed Microsoft.Extensions.Primitives.StringSegmentBenchmark by 20%: [Perf] Windows/x64: 188 Regressions from GC changes #74014 (comment)
Running the comparison in a more GC centric way with .NET 6 (net6.0) vs. .NET 6 + the clrgc from 8 (net8.0) we get an improvement of 0.57%:
Benchmark Name Baseline Comparand Baseline Mean Duration (MSec) Comparand Mean Duration (MSec) Δ Mean Duration (MSec) Δ% Mean Duration
Microsoft.Extensions.Primitives.StringSegmentBenchmark.SubString net6.0 net8.0 10.81 10.74 -0.06 -0.57

All Tests Run

BenchmarksGame.BinaryTrees_6.RunBench
System.Tests.Perf_Enum.GetNames_Generic
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True)
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)
System.Tests.Perf_GC.AllocateArray(length: 1000, pinned: False)
System.Tests.Perf_GC.AllocateArray(length: 10000, pinned: True)
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: False)
System.Tests.Perf_GC.AllocateUninitializedArray(length: 10000, pinned: True)
System.Tests.Perf_Enum.GetName_Generic_NonFlags
ByteMark.BenchLUDecomp
System.Tests.Perf_Enum.GetName_NonGeneric_Flags
System.Tests.Perf_Type.GetTypeFromHandle
BenchmarksGame.BinaryTrees_2.RunBench
BenchmarksGame.BinaryTrees_5.RunBench
System.Xml.Linq.Perf_XName.NonEmptyNameSpaceToString
System.Tests.Perf_String.Insert(s1: "Test", i: 2, s2: " Test")
System.Tests.Perf_String.Insert(s1: "dzsdzsDDZSDZSDZSddsz", i: 7, s2: "Test")
Microsoft.Extensions.Primitives.StringSegmentBenchmark.SubString

mrsharm added tenet-performance Performance related issue area-GC-coreclr labels Aug 8, 2022

mrsharm added this to the 7.0.0 milestone Aug 8, 2022

This was referenced Aug 8, 2022

Regressions in System.Tests.Perf_GC<Byte> #72477

Closed

Regressions in System.Tests.Perf_GC<Char> and ByteMark #66665

Closed

BenchmarksGame.BinaryTrees benchmarks have regressed #67958

Closed

Perf measurement changed due to BDN update #66664

Closed

mrsharm self-assigned this Aug 13, 2022

mrsharm modified the milestones: 7.0.0, 8.0.0 Aug 13, 2022

mrsharm mentioned this issue Sep 12, 2022

String.Replace(char, char) now slower in some cases #74771

Closed

mrsharm mentioned this issue Sep 23, 2022

Regressions in System.Xml.Linq.Perf_XName (FullPGO) #64626

Closed

dakersnar mentioned this issue Oct 24, 2022

Regressions in Perf_String, Perf_Stringbuilder, other on 8/9/2022 #77064

Closed

mrsharm changed the title ~~GC Perf Regression: Decommit Behavior for Regions Is Causing Regressions in Microbenchmarks~~ GC Microbenchmarks Perf Regressions: Move from Segments to Regions Aug 4, 2023

mrsharm modified the milestones: 8.0.0, 9.0.0 Aug 5, 2023

This was referenced Aug 5, 2023

[Perf] Windows/x64: 188 Regressions from GC changes #74014

Closed

GC.AllocateUninitializedArray has regressed on macOS #65198

Open

mrsharm mentioned this issue Sep 12, 2023

Microbenchmark Regressions from .NET 8 #91914

Open

mrsharm closed this as completed Sep 12, 2023

ghost locked as resolved and limited conversation to collaborators Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC Microbenchmarks Perf Regressions: Move from Segments to Regions #73592

GC Microbenchmarks Perf Regressions: Move from Segments to Regions #73592

mrsharm commented Aug 8, 2022 •

edited

Loading

ghost commented Aug 8, 2022

Summary

Perf Results

1. BenchmarksGame.BinaryTrees_5.RunBenchmark

2. ByteMark.BenchLUDecomp

How To Repro

mrsharm commented Sep 12, 2023

GC Microbenchmarks Perf Regressions: Move from Segments to Regions #73592

GC Microbenchmarks Perf Regressions: Move from Segments to Regions #73592

Comments

mrsharm commented Aug 8, 2022 • edited Loading

Summary

Perf Results Comparing NET 6.0 vs. NET 8.0

Perf Results from Net7.0

1. BenchmarksGame.BinaryTrees_5.RunBenchmark

2. ByteMark.BenchLUDecomp

How To Repro

ghost commented Aug 8, 2022

Summary

Perf Results

1. BenchmarksGame.BinaryTrees_5.RunBenchmark

2. ByteMark.BenchLUDecomp

How To Repro

mrsharm commented Sep 12, 2023

mrsharm commented Aug 8, 2022 •

edited

Loading