Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More precise writebarrier for regions #98485

Closed
wants to merge 1 commit into from

Conversation

MichalStrehovsky
Copy link
Member

Port of #67389 to Native AOT.

Adds additional checks to write barriers so that the GC can do less work. Write barriers get slower with the expectation that we'll recoup the time during garbage collections.

Was hoping to see similar gains as #67389 (comment) for our self-hosted ILC, but wallclock time got maybe 1% worse instead. However GC stats in Perfview look much better:

Before:

  • Total CPU Time: 33,478 msec
  • Total GC CPU Time: 585 msec
  • Total Allocs : 776.721 MB
  • Number of Heaps: 16
  • GC CPU MSec/MB Alloc : 0.753 MSec/MB
  • Total GC Pause: 207.7 msec
  • % Time paused for Garbage Collection: 2.4%
  • % CPU Time spent Garbage Collecting: 1.7%

After:

  • Total CPU Time: 33,348 msec
  • Total GC CPU Time: 179 msec
  • Total Allocs : 771.313 MB
  • Number of Heaps: 16
  • GC CPU MSec/MB Alloc : 0.232 MSec/MB
  • Total GC Pause: 195.8 msec
  • % Time paused for Garbage Collection: 2.3%
  • % CPU Time spent Garbage Collecting: 0.5%

Opening as a draft because maybe we can do something to make these not as expensive (CoreCLR seems to have lots of tricks up its sleeve). We also need the Linux version.

Cc @dotnet/ilc-contrib

Port of dotnet#67389 to Native AOT.

Adds additional checks to write barriers so that the GC can do less work. Write barriers get slower with the expectation that we'll recoup the time during garbage collections.

Was hoping to see similar gains as dotnet#67389 (comment) for our self-hosted ILC, but wallclock time got maybe 1% worse instead. However GC stats in Perfview look much better:

Before:

* Total CPU Time: 33,478 msec
* Total GC CPU Time: 585 msec
* Total Allocs : 776.721 MB
* Number of Heaps: 16
* GC CPU MSec/MB Alloc : 0.753 MSec/MB
* Total GC Pause: 207.7 msec
* % Time paused for Garbage Collection: 2.4%
* % CPU Time spent Garbage Collecting: 1.7%

After:

* Total CPU Time: 33,348 msec
* Total GC CPU Time: 179 msec
* Total Allocs : 771.313 MB
* Number of Heaps: 16
* GC CPU MSec/MB Alloc : 0.232 MSec/MB
* Total GC Pause: 195.8 msec
* % Time paused for Garbage Collection: 2.3%
* % CPU Time spent Garbage Collecting: 0.5%

Opening as a draft because maybe we can do something to make these not as expensive (CoreCLR seems to have lots of tricks up its sleeve). We also need the Linux version.
@ghost
Copy link

ghost commented Feb 15, 2024

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

Port of #67389 to Native AOT.

Adds additional checks to write barriers so that the GC can do less work. Write barriers get slower with the expectation that we'll recoup the time during garbage collections.

Was hoping to see similar gains as #67389 (comment) for our self-hosted ILC, but wallclock time got maybe 1% worse instead. However GC stats in Perfview look much better:

Before:

  • Total CPU Time: 33,478 msec
  • Total GC CPU Time: 585 msec
  • Total Allocs : 776.721 MB
  • Number of Heaps: 16
  • GC CPU MSec/MB Alloc : 0.753 MSec/MB
  • Total GC Pause: 207.7 msec
  • % Time paused for Garbage Collection: 2.4%
  • % CPU Time spent Garbage Collecting: 1.7%

After:

  • Total CPU Time: 33,348 msec
  • Total GC CPU Time: 179 msec
  • Total Allocs : 771.313 MB
  • Number of Heaps: 16
  • GC CPU MSec/MB Alloc : 0.232 MSec/MB
  • Total GC Pause: 195.8 msec
  • % Time paused for Garbage Collection: 2.3%
  • % CPU Time spent Garbage Collecting: 0.5%

Opening as a draft because maybe we can do something to make these not as expensive (CoreCLR seems to have lots of tricks up its sleeve). We also need the Linux version.

Cc @dotnet/ilc-contrib

Author: MichalStrehovsky
Assignees: -
Labels:

area-NativeAOT-coreclr

Milestone: -

@MichalStrehovsky
Copy link
Member Author

/azp run runtime-nativeaot-outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@MichalStrehovsky
Copy link
Member Author

Cc @dotnet/gc - I noticed this is not implemented on ARM64 for JIT. Is that intentional?

@cshung
Copy link
Member

cshung commented Feb 16, 2024

Cc @dotnet/gc - I noticed this is not implemented on ARM64 for JIT. Is that intentional?

@kunalspathak, who is working on investigating ARM64 Write Barrier performance.

@Maoni0
Copy link
Member

Maoni0 commented Feb 16, 2024

Cc @dotnet/gc - I noticed this is not implemented on ARM64 for JIT. Is that intentional?

it was intentional as in we did not allocate time to do this work for arm64, not that this work wouldn't benefit arm64.

@kunalspathak
Copy link
Member

Cc @dotnet/gc - I noticed this is not implemented on ARM64 for JIT. Is that intentional?

it was intentional as in we did not allocate time to do this work for arm64, not that this work wouldn't benefit arm64.

@Maoni0 , I will work with you offline on enabling it for arm64

@MichalStrehovsky
Copy link
Member Author

I got some measurements for the Stage2 app with e568f75 reverted to work around #98021.

Before

Run 1

application
Max CPU Usage (%) 88
Max Cores usage (%) 2,461
Max Working Set (MB) 140
Max Private Memory (MB) 159
Build Time (ms) 26,734
Start Time (ms) 130
Published Size (KB) 102,129
Symbols Size (KB) 83,729
.NET Core SDK Version 9.0.100-preview.2.24122.2
ASP.NET Core Version 9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version 9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%) 88
Max Working Set (MB) 146
Max GC Heap Size (MB) 82
Size of committed memory by the GC (MB) 109
Max Number of Gen 0 GCs / sec 26.00
Max Number of Gen 1 GCs / sec 11.00
Max Number of Gen 2 GCs / sec 2.00
Max Gen 0 GC Budget (MB) 70
Max Time in GC (%) 14.00
Max Gen 0 Size (B) 4,561,912
Max Gen 1 Size (B) 10,358,272
Max Gen 2 Size (B) 22,999,160
Max LOH Size (B) 219,088
Max POH Size (B) 1,253,192
Max Allocation Rate (B/sec) 1,506,408,440
Max GC Heap Fragmentation (%) 2,647%
# of Assemblies Loaded 124
Max Exceptions (#/s) 444
Max Lock Contention (#/s) 1,921
Max ThreadPool Threads Count 154
Max ThreadPool Queue Length 166
Max ThreadPool Items (#/s) 782,408
Max Active Timers 137
IL Jitted (B) 0
Methods Jitted 0
Load Working Set - P90 (MB) 140
Load CPU Usage - P90 (%) 86
load
Max CPU Usage (%) 26
Max Cores usage (%) 736
Max Working Set (MB) 46
Max Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 79
Requests/sec 233,925
Requests 3,532,294
Mean latency (ms) 1.11
Max latency (ms) 23.03
Bad responses 0
Socket errors 0
Read throughput (MB/s) 111.99
Latency 50th (ms) 1.07
Latency 75th (ms) 1.21
Latency 90th (ms) 1.37
Latency 99th (ms) 2.82

Run 2

application
Max CPU Usage (%) 87
Max Cores usage (%) 2,438
Max Working Set (MB) 146
Max Private Memory (MB) 170
Build Time (ms) 26,651
Start Time (ms) 133
Published Size (KB) 102,129
Symbols Size (KB) 83,729
.NET Core SDK Version 9.0.100-preview.2.24122.2
ASP.NET Core Version 9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version 9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%) 87
Max Working Set (MB) 153
Max GC Heap Size (MB) 80
Size of committed memory by the GC (MB) 117
Max Number of Gen 0 GCs / sec 26.00
Max Number of Gen 1 GCs / sec 10.00
Max Number of Gen 2 GCs / sec 2.00
Max Gen 0 GC Budget (MB) 70
Max Time in GC (%) 14.00
Max Gen 0 Size (B) 2,037,384
Max Gen 1 Size (B) 5,638,440
Max Gen 2 Size (B) 27,481,280
Max LOH Size (B) 219,088
Max POH Size (B) 1,253,192
Max Allocation Rate (B/sec) 1,518,408,496
Max GC Heap Fragmentation (%) 4,254%
# of Assemblies Loaded 124
Max Exceptions (#/s) 458
Max Lock Contention (#/s) 1,904
Max ThreadPool Threads Count 135
Max ThreadPool Queue Length 129
Max ThreadPool Items (#/s) 789,902
Max Active Timers 155
IL Jitted (B) 0
Methods Jitted 0
Load Working Set - P90 (MB) 146
Load CPU Usage - P90 (%) 85
load
Max CPU Usage (%) 26
Max Cores usage (%) 737
Max Working Set (MB) 46
Max Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 81
Requests/sec 234,916
Requests 3,547,244
Mean latency (ms) 1.10
Max latency (ms) 24.28
Bad responses 0
Socket errors 0
Read throughput (MB/s) 112.46
Latency 50th (ms) 1.06
Latency 75th (ms) 1.21
Latency 90th (ms) 1.36
Latency 99th (ms) 2.84

Run 3

application
Max CPU Usage (%) 87
Max Cores usage (%) 2,430
Max Working Set (MB) 153
Max Private Memory (MB) 172
Build Time (ms) 26,652
Start Time (ms) 144
Published Size (KB) 102,129
Symbols Size (KB) 83,729
.NET Core SDK Version 9.0.100-preview.2.24122.2
ASP.NET Core Version 9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version 9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%) 87
Max Working Set (MB) 160
Max GC Heap Size (MB) 95
Size of committed memory by the GC (MB) 122
Max Number of Gen 0 GCs / sec 25.00
Max Number of Gen 1 GCs / sec 8.00
Max Number of Gen 2 GCs / sec 2.00
Max Gen 0 GC Budget (MB) 70
Max Time in GC (%) 15.00
Max Gen 0 Size (B) 22,722,240
Max Gen 1 Size (B) 13,050,752
Max Gen 2 Size (B) 25,841,752
Max LOH Size (B) 219,088
Max POH Size (B) 1,253,192
Max Allocation Rate (B/sec) 1,503,312,264
Max GC Heap Fragmentation (%) 6,893%
# of Assemblies Loaded 124
Max Exceptions (#/s) 458
Max Lock Contention (#/s) 2,069
Max ThreadPool Threads Count 156
Max ThreadPool Queue Length 154
Max ThreadPool Items (#/s) 781,918
Max Active Timers 138
IL Jitted (B) 0
Methods Jitted 0
Load Working Set - P90 (MB) 153
Load CPU Usage - P90 (%) 86
load
Max CPU Usage (%) 26
Max Cores usage (%) 735
Max Working Set (MB) 46
Max Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 87
Requests/sec 234,420
Requests 3,539,676
Mean latency (ms) 1.11
Max latency (ms) 21.48
Bad responses 0
Socket errors 0
Read throughput (MB/s) 112.23
Latency 50th (ms) 1.07
Latency 75th (ms) 1.21
Latency 90th (ms) 1.37
Latency 99th (ms) 2.83

After

Run 1

application
Max CPU Usage (%) 87
Max Cores usage (%) 2,423
Max Working Set (MB) 151
Max Private Memory (MB) 169
Build Time (ms) 26,826
Start Time (ms) 137
Published Size (KB) 102,129
Symbols Size (KB) 83,729
.NET Core SDK Version 9.0.100-preview.2.24122.2
ASP.NET Core Version 9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version 9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%) 88
Max Working Set (MB) 158
Max GC Heap Size (MB) 95
Size of committed memory by the GC (MB) 120
Max Number of Gen 0 GCs / sec 26.00
Max Number of Gen 1 GCs / sec 10.00
Max Number of Gen 2 GCs / sec 3.00
Max Gen 0 GC Budget (MB) 70
Max Time in GC (%) 15.00
Max Gen 0 Size (B) 2,843,664
Max Gen 1 Size (B) 12,106,776
Max Gen 2 Size (B) 25,266,304
Max LOH Size (B) 219,088
Max POH Size (B) 1,253,192
Max Allocation Rate (B/sec) 1,527,291,080
Max GC Heap Fragmentation (%) 2,750%
# of Assemblies Loaded 124
Max Exceptions (#/s) 456
Max Lock Contention (#/s) 1,793
Max ThreadPool Threads Count 141
Max ThreadPool Queue Length 139
Max ThreadPool Items (#/s) 794,675
Max Active Timers 139
IL Jitted (B) 0
Methods Jitted 0
Load Working Set - P90 (MB) 151
Load CPU Usage - P90 (%) 86
load
Max CPU Usage (%) 26
Max Cores usage (%) 735
Max Working Set (MB) 45
Max Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 81
Requests/sec 235,938
Requests 3,562,489
Mean latency (ms) 1.09
Max latency (ms) 23.80
Bad responses 0
Socket errors 0
Read throughput (MB/s) 112.95
Latency 50th (ms) 1.06
Latency 75th (ms) 1.21
Latency 90th (ms) 1.35
Latency 99th (ms) 2.60

Run 2

application
Max CPU Usage (%) 87
Max Cores usage (%) 2,441
Max Working Set (MB) 139
Max Private Memory (MB) 157
Build Time (ms) 27,528
Start Time (ms) 140
Published Size (KB) 102,129
Symbols Size (KB) 83,729
.NET Core SDK Version 9.0.100-preview.2.24122.2
ASP.NET Core Version 9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version 9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%) 87
Max Working Set (MB) 145
Max GC Heap Size (MB) 83
Size of committed memory by the GC (MB) 108
Max Number of Gen 0 GCs / sec 26.00
Max Number of Gen 1 GCs / sec 16.00
Max Number of Gen 2 GCs / sec 2.00
Max Gen 0 GC Budget (MB) 70
Max Time in GC (%) 13.00
Max Gen 0 Size (B) 2,704,488
Max Gen 1 Size (B) 5,858,632
Max Gen 2 Size (B) 20,285,960
Max LOH Size (B) 219,088
Max POH Size (B) 1,253,192
Max Allocation Rate (B/sec) 1,528,119,392
Max GC Heap Fragmentation (%) 2,114%
# of Assemblies Loaded 124
Max Exceptions (#/s) 466
Max Lock Contention (#/s) 1,823
Max ThreadPool Threads Count 148
Max ThreadPool Queue Length 128
Max ThreadPool Items (#/s) 795,102
Max Active Timers 159
IL Jitted (B) 0
Methods Jitted 0
Load Working Set - P90 (MB) 139
Load CPU Usage - P90 (%) 86
load
Max CPU Usage (%) 26
Max Cores usage (%) 733
Max Working Set (MB) 45
Max Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 80
Requests/sec 236,883
Requests 3,576,893
Mean latency (ms) 1.09
Max latency (ms) 26.11
Bad responses 0
Socket errors 0
Read throughput (MB/s) 113.41
Latency 50th (ms) 1.06
Latency 75th (ms) 1.21
Latency 90th (ms) 1.35
Latency 99th (ms) 2.59

Run 3

application
Max CPU Usage (%) 86
Max Cores usage (%) 2,416
Max Working Set (MB) 139
Max Private Memory (MB) 157
Build Time (ms) 31,503
Start Time (ms) 138
Published Size (KB) 102,129
Symbols Size (KB) 83,729
.NET Core SDK Version 9.0.100-preview.2.24122.2
ASP.NET Core Version 9.0.0-preview.2.24121.1+cc5bc6b56dc0
.NET Runtime Version 9.0.0-preview.2.24122.1+d908f0030558
Max CPU Usage (%) 87
Max Working Set (MB) 145
Max GC Heap Size (MB) 88
Size of committed memory by the GC (MB) 107
Max Number of Gen 0 GCs / sec 26.00
Max Number of Gen 1 GCs / sec 13.00
Max Number of Gen 2 GCs / sec 2.00
Max Gen 0 GC Budget (MB) 70
Max Time in GC (%) 14.00
Max Gen 0 Size (B) 29,662,216
Max Gen 1 Size (B) 6,286,400
Max Gen 2 Size (B) 23,141,320
Max LOH Size (B) 219,088
Max POH Size (B) 1,253,192
Max Allocation Rate (B/sec) 1,515,032,792
Max GC Heap Fragmentation (%) 7,295%
# of Assemblies Loaded 124
Max Exceptions (#/s) 458
Max Lock Contention (#/s) 1,733
Max ThreadPool Threads Count 148
Max ThreadPool Queue Length 177
Max ThreadPool Items (#/s) 787,381
Max Active Timers 148
IL Jitted (B) 0
Methods Jitted 0
Load Working Set - P90 (MB) 139
Load CPU Usage - P90 (%) 86
load
Max CPU Usage (%) 26
Max Cores usage (%) 735
Max Working Set (MB) 46
Max Private Memory (MB) 358
Start Time (ms) 0
First Request (ms) 85
Requests/sec 234,890
Requests 3,546,791
Mean latency (ms) 1.10
Max latency (ms) 26.37
Bad responses 0
Socket errors 0
Read throughput (MB/s) 112.45
Latency 50th (ms) 1.07
Latency 75th (ms) 1.21
Latency 90th (ms) 1.35
Latency 99th (ms) 2.59

I don't see this meaninfully improving anything.

It's possible this is only viable if we can get the write barrier to be cheaper like CoreCLR does it (by basically regenerating write barriers as needed).

I almost feel like we shouldn't proceed with this. It doesn't seem worth the risk I'm creating with all of this extra assembly. I'd hate to introduce a bug in a write barrier.

@jkotas
Copy link
Member

jkotas commented Feb 23, 2024

Was hoping to see similar gains as #67389 (comment) for our self-hosted ILC

Are you able to replicate this result on current main? You can configure the write barrier to use using DOTNET_GCWriteBarrier. It is possible that both ILC and GC have been optimized in the meantime and the more precise write barrier does not make a difference for it anymore.

The more precise write barrier should help the most for workloads that have large heap, a lot of churn in Gen2 -> Gen0 references, and run on a machine with a lot of cores. For the two benchmarks you have tried:

  • The heap in Goldilocks microbenchmarks is probably too small
  • The heap in ILC may be big enough, but there is not enough churn in Gen2 -> Gen0 references for the precise write barrier to make a difference.

The workloads with a lot of churn in Gen2 -> Gen0 references are often workloads that were optimized to use pools extensively. Object pools violate the generational hypothesis and make the GC to do more work. The precise write barrier is compensating for it somewhat.

@VSadov
Copy link
Member

VSadov commented Feb 24, 2024

I don't see this meaninfully improving anything.

Note that the change effectively moves a portion of "GC tax" from the collector, which runs rarely, to the barrier which runs all the time. According to the literature (i.e. The GC book) improving barrier precision is not always a win as overall math may work against you.

For a simple example:
Let's say you spend 6% time in the barrier and 3% time in the collector.
It is easy to regress the barrier by 10%, since it is such a small piece of code. But then you need to make collector 20% faster just to break even. Which could be hard.
You`d often need considerable wins in collector to justify extra barrier complexity. It is not impossible (thus barriers are typically not completely "dumb"), but hard.

Another reason for moving work into the barrier could be to shorten pauses. That is if it reduces work that must run during pauses, like compaction. However, if it just reduces cost of marking, which can run in the background, then impact on pauses is less interesting.

When I was making NativeAOT barriers up to date with CoreCLR, I was not sure if this extra precision is necessarily a win, so I was not too eager with porting it.
(the part that we did not do this on all the platforms in CoreCLR, was another hint that missing this piece is not a huge deal)

@MichalStrehovsky
Copy link
Member Author

Are you able to replicate this result on current main?

I guess we'll see if this is still meaningful if/when it gets enabled for ARM64 and it gets re-measured on CoreCLR-JIT.

Note that the change effectively moves a portion of "GC tax" from the collector, which runs rarely, to the barrier which runs all the time.

Yep, that's what I meant with "Write barriers get slower with the expectation that we'll recoup the time during garbage collections." in the top post. The CoreCLR version of this PR had some really good numbers associated with it (#67389 (comment)) but it doesn't match what I'm seeing.

CoreCLR is able to get rid of some indirections in the write barriers due to run-time patching of the assembly code, but I'm not sure if that explains that instead of seeing a 6% improvement, we see nothing/maybe even a small regression.

I would probably be able to come up with a microbenchmark where this helps a lot (also one where it hurts a lot) but we don't have any real-word benchmark we're using with native AOT where this helps.

I'm going to close this. It doesn't look like it's worth spending time porting this to the Linux version and then living with the fear that a GC hole got introduced due to me messing up the assembly and nobody noticing it in review.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants