-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More precise writebarrier for regions #98485
Conversation
Port of dotnet#67389 to Native AOT. Adds additional checks to write barriers so that the GC can do less work. Write barriers get slower with the expectation that we'll recoup the time during garbage collections. Was hoping to see similar gains as dotnet#67389 (comment) for our self-hosted ILC, but wallclock time got maybe 1% worse instead. However GC stats in Perfview look much better: Before: * Total CPU Time: 33,478 msec * Total GC CPU Time: 585 msec * Total Allocs : 776.721 MB * Number of Heaps: 16 * GC CPU MSec/MB Alloc : 0.753 MSec/MB * Total GC Pause: 207.7 msec * % Time paused for Garbage Collection: 2.4% * % CPU Time spent Garbage Collecting: 1.7% After: * Total CPU Time: 33,348 msec * Total GC CPU Time: 179 msec * Total Allocs : 771.313 MB * Number of Heaps: 16 * GC CPU MSec/MB Alloc : 0.232 MSec/MB * Total GC Pause: 195.8 msec * % Time paused for Garbage Collection: 2.3% * % CPU Time spent Garbage Collecting: 0.5% Opening as a draft because maybe we can do something to make these not as expensive (CoreCLR seems to have lots of tricks up its sleeve). We also need the Linux version.
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas Issue DetailsPort of #67389 to Native AOT. Adds additional checks to write barriers so that the GC can do less work. Write barriers get slower with the expectation that we'll recoup the time during garbage collections. Was hoping to see similar gains as #67389 (comment) for our self-hosted ILC, but wallclock time got maybe 1% worse instead. However GC stats in Perfview look much better: Before:
After:
Opening as a draft because maybe we can do something to make these not as expensive (CoreCLR seems to have lots of tricks up its sleeve). We also need the Linux version. Cc @dotnet/ilc-contrib
|
/azp run runtime-nativeaot-outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
Cc @dotnet/gc - I noticed this is not implemented on ARM64 for JIT. Is that intentional? |
@kunalspathak, who is working on investigating ARM64 Write Barrier performance. |
it was intentional as in we did not allocate time to do this work for arm64, not that this work wouldn't benefit arm64. |
@Maoni0 , I will work with you offline on enabling it for arm64 |
I got some measurements for the Stage2 app with e568f75 reverted to work around #98021. BeforeRun 1
Run 2
Run 3
AfterRun 1
Run 2
Run 3
I don't see this meaninfully improving anything. It's possible this is only viable if we can get the write barrier to be cheaper like CoreCLR does it (by basically regenerating write barriers as needed). I almost feel like we shouldn't proceed with this. It doesn't seem worth the risk I'm creating with all of this extra assembly. I'd hate to introduce a bug in a write barrier. |
Are you able to replicate this result on current main? You can configure the write barrier to use using The more precise write barrier should help the most for workloads that have large heap, a lot of churn in Gen2 -> Gen0 references, and run on a machine with a lot of cores. For the two benchmarks you have tried:
The workloads with a lot of churn in Gen2 -> Gen0 references are often workloads that were optimized to use pools extensively. Object pools violate the generational hypothesis and make the GC to do more work. The precise write barrier is compensating for it somewhat. |
Note that the change effectively moves a portion of "GC tax" from the collector, which runs rarely, to the barrier which runs all the time. According to the literature (i.e. The GC book) improving barrier precision is not always a win as overall math may work against you. For a simple example: Another reason for moving work into the barrier could be to shorten pauses. That is if it reduces work that must run during pauses, like compaction. However, if it just reduces cost of marking, which can run in the background, then impact on pauses is less interesting. When I was making NativeAOT barriers up to date with CoreCLR, I was not sure if this extra precision is necessarily a win, so I was not too eager with porting it. |
I guess we'll see if this is still meaningful if/when it gets enabled for ARM64 and it gets re-measured on CoreCLR-JIT.
Yep, that's what I meant with "Write barriers get slower with the expectation that we'll recoup the time during garbage collections." in the top post. The CoreCLR version of this PR had some really good numbers associated with it (#67389 (comment)) but it doesn't match what I'm seeing. CoreCLR is able to get rid of some indirections in the write barriers due to run-time patching of the assembly code, but I'm not sure if that explains that instead of seeing a 6% improvement, we see nothing/maybe even a small regression. I would probably be able to come up with a microbenchmark where this helps a lot (also one where it hurts a lot) but we don't have any real-word benchmark we're using with native AOT where this helps. I'm going to close this. It doesn't look like it's worth spending time porting this to the Linux version and then living with the fear that a GC hole got introduced due to me messing up the assembly and nobody noticing it in review. |
Port of #67389 to Native AOT.
Adds additional checks to write barriers so that the GC can do less work. Write barriers get slower with the expectation that we'll recoup the time during garbage collections.
Was hoping to see similar gains as #67389 (comment) for our self-hosted ILC, but wallclock time got maybe 1% worse instead. However GC stats in Perfview look much better:
Before:
After:
Opening as a draft because maybe we can do something to make these not as expensive (CoreCLR seems to have lots of tricks up its sleeve). We also need the Linux version.
Cc @dotnet/ilc-contrib