More precise writebarrier for regions #98485

MichalStrehovsky · 2024-02-15T11:04:43Z

Port of #67389 to Native AOT.

Adds additional checks to write barriers so that the GC can do less work. Write barriers get slower with the expectation that we'll recoup the time during garbage collections.

Was hoping to see similar gains as #67389 (comment) for our self-hosted ILC, but wallclock time got maybe 1% worse instead. However GC stats in Perfview look much better:

Before:

Total CPU Time: 33,478 msec
Total GC CPU Time: 585 msec
Total Allocs : 776.721 MB
Number of Heaps: 16
GC CPU MSec/MB Alloc : 0.753 MSec/MB
Total GC Pause: 207.7 msec
% Time paused for Garbage Collection: 2.4%
% CPU Time spent Garbage Collecting: 1.7%

After:

Total CPU Time: 33,348 msec
Total GC CPU Time: 179 msec
Total Allocs : 771.313 MB
Number of Heaps: 16
GC CPU MSec/MB Alloc : 0.232 MSec/MB
Total GC Pause: 195.8 msec
% Time paused for Garbage Collection: 2.3%
% CPU Time spent Garbage Collecting: 0.5%

Opening as a draft because maybe we can do something to make these not as expensive (CoreCLR seems to have lots of tricks up its sleeve). We also need the Linux version.

Cc @dotnet/ilc-contrib

Port of dotnet#67389 to Native AOT. Adds additional checks to write barriers so that the GC can do less work. Write barriers get slower with the expectation that we'll recoup the time during garbage collections. Was hoping to see similar gains as dotnet#67389 (comment) for our self-hosted ILC, but wallclock time got maybe 1% worse instead. However GC stats in Perfview look much better: Before: * Total CPU Time: 33,478 msec * Total GC CPU Time: 585 msec * Total Allocs : 776.721 MB * Number of Heaps: 16 * GC CPU MSec/MB Alloc : 0.753 MSec/MB * Total GC Pause: 207.7 msec * % Time paused for Garbage Collection: 2.4% * % CPU Time spent Garbage Collecting: 1.7% After: * Total CPU Time: 33,348 msec * Total GC CPU Time: 179 msec * Total Allocs : 771.313 MB * Number of Heaps: 16 * GC CPU MSec/MB Alloc : 0.232 MSec/MB * Total GC Pause: 195.8 msec * % Time paused for Garbage Collection: 2.3% * % CPU Time spent Garbage Collecting: 0.5% Opening as a draft because maybe we can do something to make these not as expensive (CoreCLR seems to have lots of tricks up its sleeve). We also need the Linux version.

ghost · 2024-02-15T11:04:48Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Port of #67389 to Native AOT.

Adds additional checks to write barriers so that the GC can do less work. Write barriers get slower with the expectation that we'll recoup the time during garbage collections.

Was hoping to see similar gains as #67389 (comment) for our self-hosted ILC, but wallclock time got maybe 1% worse instead. However GC stats in Perfview look much better:

Before:

Total CPU Time: 33,478 msec
Total GC CPU Time: 585 msec
Total Allocs : 776.721 MB
Number of Heaps: 16
GC CPU MSec/MB Alloc : 0.753 MSec/MB
Total GC Pause: 207.7 msec
% Time paused for Garbage Collection: 2.4%
% CPU Time spent Garbage Collecting: 1.7%

After:

Total CPU Time: 33,348 msec
Total GC CPU Time: 179 msec
Total Allocs : 771.313 MB
Number of Heaps: 16
GC CPU MSec/MB Alloc : 0.232 MSec/MB
Total GC Pause: 195.8 msec
% Time paused for Garbage Collection: 2.3%
% CPU Time spent Garbage Collecting: 0.5%

Opening as a draft because maybe we can do something to make these not as expensive (CoreCLR seems to have lots of tricks up its sleeve). We also need the Linux version.

Cc @dotnet/ilc-contrib

Author:	MichalStrehovsky
Assignees:	-
Labels:	`area-NativeAOT-coreclr`
Milestone:	-

MichalStrehovsky · 2024-02-15T11:08:56Z

/azp run runtime-nativeaot-outerloop

azure-pipelines · 2024-02-15T11:09:05Z

Azure Pipelines successfully started running 1 pipeline(s).

MichalStrehovsky · 2024-02-15T21:33:56Z

Cc @dotnet/gc - I noticed this is not implemented on ARM64 for JIT. Is that intentional?

cshung · 2024-02-16T23:37:08Z

Cc @dotnet/gc - I noticed this is not implemented on ARM64 for JIT. Is that intentional?

@kunalspathak, who is working on investigating ARM64 Write Barrier performance.

Maoni0 · 2024-02-16T23:41:40Z

Cc @dotnet/gc - I noticed this is not implemented on ARM64 for JIT. Is that intentional?

it was intentional as in we did not allocate time to do this work for arm64, not that this work wouldn't benefit arm64.

kunalspathak · 2024-02-16T23:44:38Z

Cc @dotnet/gc - I noticed this is not implemented on ARM64 for JIT. Is that intentional?

it was intentional as in we did not allocate time to do this work for arm64, not that this work wouldn't benefit arm64.

@Maoni0 , I will work with you offline on enabling it for arm64

MichalStrehovsky · 2024-02-23T08:09:12Z

I got some measurements for the Stage2 app with e568f75 reverted to work around #98021.

Before

Run 1

application
Max CPU Usage (%)	88
Max Cores usage (%)	2,461
Max Working Set (MB)	140
Max Private Memory (MB)	159
Build Time (ms)	26,734
Start Time (ms)	130
Published Size (KB)	102,129
Symbols Size (KB)	83,729
.NET Core SDK Version	9.0.100-preview.2.24122.2
ASP.NET Core Version	9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version	9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%)	88
Max Working Set (MB)	146
Max GC Heap Size (MB)	82
Size of committed memory by the GC (MB)	109
Max Number of Gen 0 GCs / sec	26.00
Max Number of Gen 1 GCs / sec	11.00
Max Number of Gen 2 GCs / sec	2.00
Max Gen 0 GC Budget (MB)	70
Max Time in GC (%)	14.00
Max Gen 0 Size (B)	4,561,912
Max Gen 1 Size (B)	10,358,272
Max Gen 2 Size (B)	22,999,160
Max LOH Size (B)	219,088
Max POH Size (B)	1,253,192
Max Allocation Rate (B/sec)	1,506,408,440
Max GC Heap Fragmentation (%)	2,647%
# of Assemblies Loaded	124
Max Exceptions (#/s)	444
Max Lock Contention (#/s)	1,921
Max ThreadPool Threads Count	154
Max ThreadPool Queue Length	166
Max ThreadPool Items (#/s)	782,408
Max Active Timers	137
IL Jitted (B)	0
Methods Jitted	0
Load Working Set - P90 (MB)	140
Load CPU Usage - P90 (%)	86

load
Max CPU Usage (%)	26
Max Cores usage (%)	736
Max Working Set (MB)	46
Max Private Memory (MB)	358
Start Time (ms)	0
First Request (ms)	79
Requests/sec	233,925
Requests	3,532,294
Mean latency (ms)	1.11
Max latency (ms)	23.03
Bad responses	0
Socket errors	0
Read throughput (MB/s)	111.99
Latency 50th (ms)	1.07
Latency 75th (ms)	1.21
Latency 90th (ms)	1.37
Latency 99th (ms)	2.82

Run 2

application
Max CPU Usage (%)	87
Max Cores usage (%)	2,438
Max Working Set (MB)	146
Max Private Memory (MB)	170
Build Time (ms)	26,651
Start Time (ms)	133
Published Size (KB)	102,129
Symbols Size (KB)	83,729
.NET Core SDK Version	9.0.100-preview.2.24122.2
ASP.NET Core Version	9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version	9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%)	87
Max Working Set (MB)	153
Max GC Heap Size (MB)	80
Size of committed memory by the GC (MB)	117
Max Number of Gen 0 GCs / sec	26.00
Max Number of Gen 1 GCs / sec	10.00
Max Number of Gen 2 GCs / sec	2.00
Max Gen 0 GC Budget (MB)	70
Max Time in GC (%)	14.00
Max Gen 0 Size (B)	2,037,384
Max Gen 1 Size (B)	5,638,440
Max Gen 2 Size (B)	27,481,280
Max LOH Size (B)	219,088
Max POH Size (B)	1,253,192
Max Allocation Rate (B/sec)	1,518,408,496
Max GC Heap Fragmentation (%)	4,254%
# of Assemblies Loaded	124
Max Exceptions (#/s)	458
Max Lock Contention (#/s)	1,904
Max ThreadPool Threads Count	135
Max ThreadPool Queue Length	129
Max ThreadPool Items (#/s)	789,902
Max Active Timers	155
IL Jitted (B)	0
Methods Jitted	0
Load Working Set - P90 (MB)	146
Load CPU Usage - P90 (%)	85

load
Max CPU Usage (%)	26
Max Cores usage (%)	737
Max Working Set (MB)	46
Max Private Memory (MB)	358
Start Time (ms)	0
First Request (ms)	81
Requests/sec	234,916
Requests	3,547,244
Mean latency (ms)	1.10
Max latency (ms)	24.28
Bad responses	0
Socket errors	0
Read throughput (MB/s)	112.46
Latency 50th (ms)	1.06
Latency 75th (ms)	1.21
Latency 90th (ms)	1.36
Latency 99th (ms)	2.84

Run 3

application
Max CPU Usage (%)	87
Max Cores usage (%)	2,430
Max Working Set (MB)	153
Max Private Memory (MB)	172
Build Time (ms)	26,652
Start Time (ms)	144
Published Size (KB)	102,129
Symbols Size (KB)	83,729
.NET Core SDK Version	9.0.100-preview.2.24122.2
ASP.NET Core Version	9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version	9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%)	87
Max Working Set (MB)	160
Max GC Heap Size (MB)	95
Size of committed memory by the GC (MB)	122
Max Number of Gen 0 GCs / sec	25.00
Max Number of Gen 1 GCs / sec	8.00
Max Number of Gen 2 GCs / sec	2.00
Max Gen 0 GC Budget (MB)	70
Max Time in GC (%)	15.00
Max Gen 0 Size (B)	22,722,240
Max Gen 1 Size (B)	13,050,752
Max Gen 2 Size (B)	25,841,752
Max LOH Size (B)	219,088
Max POH Size (B)	1,253,192
Max Allocation Rate (B/sec)	1,503,312,264
Max GC Heap Fragmentation (%)	6,893%
# of Assemblies Loaded	124
Max Exceptions (#/s)	458
Max Lock Contention (#/s)	2,069
Max ThreadPool Threads Count	156
Max ThreadPool Queue Length	154
Max ThreadPool Items (#/s)	781,918
Max Active Timers	138
IL Jitted (B)	0
Methods Jitted	0
Load Working Set - P90 (MB)	153
Load CPU Usage - P90 (%)	86

load
Max CPU Usage (%)	26
Max Cores usage (%)	735
Max Working Set (MB)	46
Max Private Memory (MB)	358
Start Time (ms)	0
First Request (ms)	87
Requests/sec	234,420
Requests	3,539,676
Mean latency (ms)	1.11
Max latency (ms)	21.48
Bad responses	0
Socket errors	0
Read throughput (MB/s)	112.23
Latency 50th (ms)	1.07
Latency 75th (ms)	1.21
Latency 90th (ms)	1.37
Latency 99th (ms)	2.83

After

Run 1

application
Max CPU Usage (%)	87
Max Cores usage (%)	2,423
Max Working Set (MB)	151
Max Private Memory (MB)	169
Build Time (ms)	26,826
Start Time (ms)	137
Published Size (KB)	102,129
Symbols Size (KB)	83,729
.NET Core SDK Version	9.0.100-preview.2.24122.2
ASP.NET Core Version	9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version	9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%)	88
Max Working Set (MB)	158
Max GC Heap Size (MB)	95
Size of committed memory by the GC (MB)	120
Max Number of Gen 0 GCs / sec	26.00
Max Number of Gen 1 GCs / sec	10.00
Max Number of Gen 2 GCs / sec	3.00
Max Gen 0 GC Budget (MB)	70
Max Time in GC (%)	15.00
Max Gen 0 Size (B)	2,843,664
Max Gen 1 Size (B)	12,106,776
Max Gen 2 Size (B)	25,266,304
Max LOH Size (B)	219,088
Max POH Size (B)	1,253,192
Max Allocation Rate (B/sec)	1,527,291,080
Max GC Heap Fragmentation (%)	2,750%
# of Assemblies Loaded	124
Max Exceptions (#/s)	456
Max Lock Contention (#/s)	1,793
Max ThreadPool Threads Count	141
Max ThreadPool Queue Length	139
Max ThreadPool Items (#/s)	794,675
Max Active Timers	139
IL Jitted (B)	0
Methods Jitted	0
Load Working Set - P90 (MB)	151
Load CPU Usage - P90 (%)	86

load
Max CPU Usage (%)	26
Max Cores usage (%)	735
Max Working Set (MB)	45
Max Private Memory (MB)	358
Start Time (ms)	0
First Request (ms)	81
Requests/sec	235,938
Requests	3,562,489
Mean latency (ms)	1.09
Max latency (ms)	23.80
Bad responses	0
Socket errors	0
Read throughput (MB/s)	112.95
Latency 50th (ms)	1.06
Latency 75th (ms)	1.21
Latency 90th (ms)	1.35
Latency 99th (ms)	2.60

Run 2

application
Max CPU Usage (%)	87
Max Cores usage (%)	2,441
Max Working Set (MB)	139
Max Private Memory (MB)	157
Build Time (ms)	27,528
Start Time (ms)	140
Published Size (KB)	102,129
Symbols Size (KB)	83,729
.NET Core SDK Version	9.0.100-preview.2.24122.2
ASP.NET Core Version	9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version	9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%)	87
Max Working Set (MB)	145
Max GC Heap Size (MB)	83
Size of committed memory by the GC (MB)	108
Max Number of Gen 0 GCs / sec	26.00
Max Number of Gen 1 GCs / sec	16.00
Max Number of Gen 2 GCs / sec	2.00
Max Gen 0 GC Budget (MB)	70
Max Time in GC (%)	13.00
Max Gen 0 Size (B)	2,704,488
Max Gen 1 Size (B)	5,858,632
Max Gen 2 Size (B)	20,285,960
Max LOH Size (B)	219,088
Max POH Size (B)	1,253,192
Max Allocation Rate (B/sec)	1,528,119,392
Max GC Heap Fragmentation (%)	2,114%
# of Assemblies Loaded	124
Max Exceptions (#/s)	466
Max Lock Contention (#/s)	1,823
Max ThreadPool Threads Count	148
Max ThreadPool Queue Length	128
Max ThreadPool Items (#/s)	795,102
Max Active Timers	159
IL Jitted (B)	0
Methods Jitted	0
Load Working Set - P90 (MB)	139
Load CPU Usage - P90 (%)	86

load
Max CPU Usage (%)	26
Max Cores usage (%)	733
Max Working Set (MB)	45
Max Private Memory (MB)	358
Start Time (ms)	0
First Request (ms)	80
Requests/sec	236,883
Requests	3,576,893
Mean latency (ms)	1.09
Max latency (ms)	26.11
Bad responses	0
Socket errors	0
Read throughput (MB/s)	113.41
Latency 50th (ms)	1.06
Latency 75th (ms)	1.21
Latency 90th (ms)	1.35
Latency 99th (ms)	2.59

Run 3

application
Max CPU Usage (%)	86
Max Cores usage (%)	2,416
Max Working Set (MB)	139
Max Private Memory (MB)	157
Build Time (ms)	31,503
Start Time (ms)	138
Published Size (KB)	102,129
Symbols Size (KB)	83,729
.NET Core SDK Version	9.0.100-preview.2.24122.2
ASP.NET Core Version	9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version	9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%)	87
Max Working Set (MB)	145
Max GC Heap Size (MB)	88
Size of committed memory by the GC (MB)	107
Max Number of Gen 0 GCs / sec	26.00
Max Number of Gen 1 GCs / sec	13.00
Max Number of Gen 2 GCs / sec	2.00
Max Gen 0 GC Budget (MB)	70
Max Time in GC (%)	14.00
Max Gen 0 Size (B)	29,662,216
Max Gen 1 Size (B)	6,286,400
Max Gen 2 Size (B)	23,141,320
Max LOH Size (B)	219,088
Max POH Size (B)	1,253,192
Max Allocation Rate (B/sec)	1,515,032,792
Max GC Heap Fragmentation (%)	7,295%
# of Assemblies Loaded	124
Max Exceptions (#/s)	458
Max Lock Contention (#/s)	1,733
Max ThreadPool Threads Count	148
Max ThreadPool Queue Length	177
Max ThreadPool Items (#/s)	787,381
Max Active Timers	148
IL Jitted (B)	0
Methods Jitted	0
Load Working Set - P90 (MB)	139
Load CPU Usage - P90 (%)	86

load
Max CPU Usage (%)	26
Max Cores usage (%)	735
Max Working Set (MB)	46
Max Private Memory (MB)	358
Start Time (ms)	0
First Request (ms)	85
Requests/sec	234,890
Requests	3,546,791
Mean latency (ms)	1.10
Max latency (ms)	26.37
Bad responses	0
Socket errors	0
Read throughput (MB/s)	112.45
Latency 50th (ms)	1.07
Latency 75th (ms)	1.21
Latency 90th (ms)	1.35
Latency 99th (ms)	2.59

I don't see this meaninfully improving anything.

It's possible this is only viable if we can get the write barrier to be cheaper like CoreCLR does it (by basically regenerating write barriers as needed).

I almost feel like we shouldn't proceed with this. It doesn't seem worth the risk I'm creating with all of this extra assembly. I'd hate to introduce a bug in a write barrier.

jkotas · 2024-02-23T17:20:24Z

Was hoping to see similar gains as #67389 (comment) for our self-hosted ILC

Are you able to replicate this result on current main? You can configure the write barrier to use using DOTNET_GCWriteBarrier. It is possible that both ILC and GC have been optimized in the meantime and the more precise write barrier does not make a difference for it anymore.

The more precise write barrier should help the most for workloads that have large heap, a lot of churn in Gen2 -> Gen0 references, and run on a machine with a lot of cores. For the two benchmarks you have tried:

The heap in Goldilocks microbenchmarks is probably too small
The heap in ILC may be big enough, but there is not enough churn in Gen2 -> Gen0 references for the precise write barrier to make a difference.

The workloads with a lot of churn in Gen2 -> Gen0 references are often workloads that were optimized to use pools extensively. Object pools violate the generational hypothesis and make the GC to do more work. The precise write barrier is compensating for it somewhat.

VSadov · 2024-02-24T18:16:08Z

I don't see this meaninfully improving anything.

Note that the change effectively moves a portion of "GC tax" from the collector, which runs rarely, to the barrier which runs all the time. According to the literature (i.e. The GC book) improving barrier precision is not always a win as overall math may work against you.

For a simple example:
Let's say you spend 6% time in the barrier and 3% time in the collector.
It is easy to regress the barrier by 10%, since it is such a small piece of code. But then you need to make collector 20% faster just to break even. Which could be hard.
You`d often need considerable wins in collector to justify extra barrier complexity. It is not impossible (thus barriers are typically not completely "dumb"), but hard.

Another reason for moving work into the barrier could be to shorten pauses. That is if it reduces work that must run during pauses, like compaction. However, if it just reduces cost of marking, which can run in the background, then impact on pauses is less interesting.

When I was making NativeAOT barriers up to date with CoreCLR, I was not sure if this extra precision is necessarily a win, so I was not too eager with porting it.
(the part that we did not do this on all the platforms in CoreCLR, was another hint that missing this piece is not a huge deal)

MichalStrehovsky · 2024-02-26T06:14:53Z

Are you able to replicate this result on current main?

I guess we'll see if this is still meaningful if/when it gets enabled for ARM64 and it gets re-measured on CoreCLR-JIT.

Note that the change effectively moves a portion of "GC tax" from the collector, which runs rarely, to the barrier which runs all the time.

Yep, that's what I meant with "Write barriers get slower with the expectation that we'll recoup the time during garbage collections." in the top post. The CoreCLR version of this PR had some really good numbers associated with it (#67389 (comment)) but it doesn't match what I'm seeing.

CoreCLR is able to get rid of some indirections in the write barriers due to run-time patching of the assembly code, but I'm not sure if that explains that instead of seeing a 6% improvement, we see nothing/maybe even a small regression.

I would probably be able to come up with a microbenchmark where this helps a lot (also one where it hurts a lot) but we don't have any real-word benchmark we're using with native AOT where this helps.

I'm going to close this. It doesn't look like it's worth spending time porting this to the Linux version and then living with the fear that a GC hole got introduced due to me messing up the assembly and nobody noticing it in review.

MichalStrehovsky added the area-NativeAOT-coreclr label Feb 15, 2024

ghost assigned MichalStrehovsky Feb 15, 2024

build-analysis bot mentioned this pull request Feb 15, 2024

Unhandled exception. System.Exception: Profilee returned exit code 139 instead of expected exit code 100 #98500

Closed

MichalStrehovsky closed this Feb 26, 2024

github-actions bot locked and limited conversation to collaborators Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More precise writebarrier for regions #98485

More precise writebarrier for regions #98485

MichalStrehovsky commented Feb 15, 2024

ghost commented Feb 15, 2024

MichalStrehovsky commented Feb 15, 2024

azure-pipelines bot commented Feb 15, 2024

MichalStrehovsky commented Feb 15, 2024

cshung commented Feb 16, 2024

Maoni0 commented Feb 16, 2024

kunalspathak commented Feb 16, 2024

MichalStrehovsky commented Feb 23, 2024

jkotas commented Feb 23, 2024 •

edited

Loading

VSadov commented Feb 24, 2024 •

edited

Loading

MichalStrehovsky commented Feb 26, 2024

More precise writebarrier for regions #98485

More precise writebarrier for regions #98485

Conversation

MichalStrehovsky commented Feb 15, 2024

ghost commented Feb 15, 2024

MichalStrehovsky commented Feb 15, 2024

azure-pipelines bot commented Feb 15, 2024

MichalStrehovsky commented Feb 15, 2024

cshung commented Feb 16, 2024

Maoni0 commented Feb 16, 2024

kunalspathak commented Feb 16, 2024

MichalStrehovsky commented Feb 23, 2024

Before

Run 1

Run 2

Run 3

After

Run 1

Run 2

Run 3

jkotas commented Feb 23, 2024 • edited Loading

VSadov commented Feb 24, 2024 • edited Loading

MichalStrehovsky commented Feb 26, 2024

jkotas commented Feb 23, 2024 •

edited

Loading

VSadov commented Feb 24, 2024 •

edited

Loading