Stabilize performance measurement #43227

kunalspathak · 2020-10-09T18:15:24Z

The code quality and performance of RyuJIT is tracked internally by running MicroBenchmarks in our performance lab. We regularly triage the performance issues opened by the .NET performance team. After going through these issues for past several months, we have identified some key points.

Stability

Many times, the set of commits that are flagged as introducing regression in a benchmark, do not touch the code that is tested in the benchmark. In fact, the assembly code generated for the .NET code that is being tested is often identical and yet the measurements show differences. Some of our investigation reveals that the fluctuation in the benchmark measurements happen because of the misalignment of generated JIT code in process memory. Below is an example of LoopReturn benchmark that shows such behavior.

It is very time consuming for .NET developers to do the analysis of benchmarks that regressed because of things that are out of control of .NET runtime. In the past, we have closed several issues like #13770, #39721 and #39722 because they were regressions because of code alignment. A great example that we found out while investigating those issues was the change introduced in #38586 eliminated a test instruction and should have showed improvement in the benchmarks, but introduced regression because the code (loop code inside method) now gets misaligned and the method runs slower.

Alignment issues was brought up few times in #9912 and #8108 and this issue tracks the progress towards the goal of stabilizing and possibly improving the performance of .NET apps that are heavily affected because of code alignment.

Performance lab infrastructure

Once we address the code alignment issue, the next big thing will be to identify and make required infrastructure changes in our performance lab to make sure that it can easily flag such issues without needing much interaction from .NET developers. For example, dotnet/BenchmarkDotNet#1513 proposes to make memory alignment in the benchmark run random to catch these issues early and once we address the underlying problem in .NET, we should never see bimodal behavior of those benchmarks. After that, if the performance lab does find a regression in the benchmark, we need to have robust tooling support to get possible metrics from performance runs so that a developer doing the investigation can easily identify the cause of regression. For example, identifying the time spent in various phases of .NET runtime like Jitting, Jit interface, Tier0/Tier1 JIT code, hot methods, instructions retired during benchmark execution and so forth.

Reliable benchmarks collection

Lastly, for developers working on JIT, we want to identify set of benchmarks that are stable enough and can be trusted to give us reliable measurement whenever there is a need to verify the performance for changes done to the JIT codebase. This will help us conduct performance testing ahead of time and identify potential regressions rather than waiting it to happen in performance lab.

Here are set of work items that we have identified to achieve all the above:

Code alignment work

Make method entries of JIT code having loops aligned at 32-byte boundary for xarch. (Done in Fix condition to align methods to 32 bytes boundary #42909)
Devise a mechanism to identify inner loops (the ones that don't have more loops inside) present in a method. (Done: Align inner loops #44370)
Experiment with heuristics that should decide if a particular inner loop needs alignment or not. (Done: Align inner loops #44370)
Calculate the padding that needs to be added to align the identified inner loop. (Done: Align inner loops #44370)
Basic: Add the padding near the loop header (either at the end of previous block or at the beginning of the loop header). (Done: Align inner loops #44370)
Before merging above work, measure the impact of loop alignment on Microbenchmarks. Tracked in Measure loop alignment's performance impact on Microbenchmarks #44051. Also in Stabilize performance measurement #43227 (comment)
Identify benchmarks that get stabilized or whose performance is improved by the above work. (Update: See the analysis starting from Stabilize performance measurement #43227 (comment))

Future work

Make method entries of JIT code having loops aligned at relevant boundary for arm32/arm64. See section 4.8 of https://developer.arm.com/documentation/swog309707/a.
Combine branch tightening and loop alignment adjustments in single phase.
Advanced: If padding proves to be costly, have a way to spread the padding throughout the method so the loop header gets aligned. Account the padding while doing branch tensioning.
- Add padding at the blind spot like after jmp or ret instruction that comes before align instruction.
- Add padding at the end of blocks that has lower weight than the block that precedes the loop header block.
- Explore option to have a method misaligned such that we can skip the padding needed for loops and they get auto-aligned or aligned with minimal padding.
Perform loop alignment similar to Align inner loops #44370 for R2R code.
For x86, we should improve the encoding we use for multiple size NOP instructions. Today, we just output repeated single byte 90 , but could do better like we do for x64.
Explore option to align enclosing loops if it can borrow some of the padding needed for inner loop. E.g. in a nested loop, if there is an inner loop that needs padding of 10 bytes and the outer loop can be aligned by adding padding of 4 bytes, then add padding of 4 bytes to outer loop and 6 bytes to inner loop. That way, both the loops are aligned.

Performance tooling work

Add random memory alignment. Tracked by Make memory alignment more random BenchmarkDotNet#1513 and Refactor initialization logic to allow for enabling Memory Randomization performance#1587
Add various tooling in PerfView, BenchmarkDotNet analyzers to add the required metric. (WIP)
Perform superpmi collection of microbenchmarks so codegen team can easily test the impact of their change on generated code of various benchmarks. (Superpmi on Microbenchmarks #47900)
Add a new event to track memory allocated for JIT code that can be surfaced in PerfView through JITStats. PR: New event to track memory allocation for JIT code #44030 and Display JIT Allocated Heap Size microsoft/perfview#1289

The text was updated successfully, but these errors were encountered:

kunalspathak · 2020-10-09T18:15:40Z

@dotnet/jit-contrib , @danmosemsft

danmoseley · 2020-10-09T19:00:52Z

cc @DrewScoggins @billwert @adamsitnik

danmoseley · 2020-10-09T19:29:31Z

To state the obvious (I think) -- I believe we have good data to suggest that alignment is the dominant reason for bimodality. I'm not sure though that we can be sure there aren't other common causes for bimodality -- my assumption is that we'll find out how much is left when you've completed some of this work.

kunalspathak · 2020-10-09T19:32:30Z

To state the obvious (I think) -- I believe we have good data to suggest that alignment is the dominant reason for bimodality. I'm not sure though that we can be sure there aren't other common causes for bimodality -- my assumption is that we'll find out how much is left when you've completed some of this work.

Yes, there will definitely be more reason for bimodality, but alignment will fix most of the obvious ones that we know of and then, it will be easier for us to focus one remaining ones. Currently, there are just too many bimodal benchmarks.

JulieLeeMSFT · 2020-11-02T21:46:57Z

Bolded WIP texts in the plan.

danmoseley · 2021-01-27T21:38:28Z

As an aside, is there more than BDN can do to help regularize data alignment, when the buffer is allocated by the test and reused?

Never mind, I forgot about dotnet/performance#1587 etc. we're attacking both sides. 😊

kunalspathak · 2021-02-04T00:04:31Z

Another round of analysis.

Stability

I went through the stability data for windows x64 and linux x64 that @DrewScoggins shared with me. In this report, "regression" means that loop alignment introduced more variance to the measurements and thus made the benchmark instable, while "improvements" meant that the loop alignment work actually stabilized the benchmark and reduced the variance. I analyzed around 150 benchmarks to see if the regressions were real by checking at the historical data of Windows x64 and Linux x64 for each of the 150 benchmark. If I see that the benchmark had lot of variance historically, I concluded that the loop alignment didn't added to the variance. There were few exceptions like Perf_Enumerable.Count and Single.Max where post-loop alignment we saw some variance, but other than that, most of the regression pointed out in report were not related to the loop alignment work.

Performance

Here are some of the take away points:

JIT maintains a loop table to track all the loops it has seen and use it to perform various optimizations and analysis. It turns out that there is a bug (or a missing feature, whatever you call) where we do not record the cloned loop inside loop table. With that, we miss out opportunities to do the required analysis on cloned loops. Tracking issue: Certain loops do not get recorded in optLoopTable #43713
We sometimes add "NOP compensation" for the over-estimated instruction. Most of the over-estimated instructions are jumps and that that happens because branch tightening happens before loop alignment adjustment phase. In branch tightening we adjust the size of jump instructions depending on the target offset, but later, the loop alignment adjustment can further shorten the IG offsets resulting in over-estimated jump instruction. To address this, we need to combine branch tightening and loop alignment adjustment in same phase rather than separating them. I have updated the work item list to capture this explicitly. After doing this, we might further think about adding the compensation behind jmp or in cold block.
Often the encoding of jump instruction before vs. after loop alignment might increase because of added padding instructions. I do not think we can do much about it.

FannkuchRedux_9

The regression happens because of extra padding we added to align some of the loops inside benchmark code. Asmdiffs: https://www.diffchecker.com/ryZqqO6I.

MulMatrix

The regression happens because of addition of 5 "NOP overestimate compensation" in hot blocks. The overestimation happened for jump instructions. The way to address is by combining branch tightening with loop alignment adjustments (already planned work). Asmdiffs: https://www.diffchecker.com/IcktyNPi

TryGetUInt16:

In this method, TryParseUInt16D() is the hot method as seen from VTune screenshot:

With loop alignment, we ended up adding padding at the end of hot block that VTune pointed above (Block 2).
Asmdiffs: https://www.diffchecker.com/jHt5pv99

GetTypeCode

Although this shows regression, I verified that there is no loop inside the benchmark, nor a loop in the libraries method that it test. Further, I also compared the asm diffs and there is no "align" instruction.

ContainsTrue/ContainsFalse

This one is interesting. The hot method for all these benchmarks is GenericEqualityComparer`1[__Canon][System.__Canon]:IndexOf. In this method, we clone a loop, but later decide to not align it because it had call. However, we do not record that fact in the cloned loop because of which we have to align the cloned loop containing call. Additionally, until we emit the padding in the cloned loop, we have to also compensate the over-estimated instructions. One of the place where such overestimated instruction is compensated happen to be the actual hot loop that the method executes. You can see my detail analysis in #43713 (comment).

JulieLeeMSFT · 2021-02-04T15:21:57Z

Good analysis that loop alignment work didn’t introduce regression. What is cloned loop in the loop table? Thanks, Julie

…

kunalspathak · 2021-02-04T16:18:38Z

What is cloned loop in the loop table? Thanks, Julie

Sometimes, if a loop has checks like array's bound check, we will clone a loop. One variant of the loop will have no bound checks (and hence will be fast loop) while other variant will have bound checks (and will be a slow loop). Depending on the input, we decide whether to execute fast loop or the cloned (slow) loop. You can see an example here. Here, fast loop is L003e ~ L0058 while cloned slow loop is L005c ~ L0080. During JIT, we track all the loops in a data structure called optLoopTable. Whenever we want to do any analysis on loops, we iterate over this table to decide whether the given loop needs any update/optimization. When we clone a loop, we do not inserted an entry of cloned loop inside optLoopTable because of which further analysis doesn't happen on cloned loop. Towards the end of my comment in #43713 (comment), I have listed what type of analysis we miss out on cloned loop because of lack of recording its entry inside optLoopTable.

JulieLeeMSFT · 2021-03-22T20:28:39Z

Moved future since .NET 6 scope items are complete now.

MithrilMan · 2021-08-13T00:15:47Z

I read your article here about loop alignment here https://devblogs.microsoft.com/dotnet/loop-alignment-in-net-6/ and it mentions that a loop like this

for (int l = 0; l < M; l++) {
        // body
        OtherMethod();
    }

isn't a candidate for inlining because contains a call to a method
What about methods that are inlined?
Shouldn't inline method be considered like loop body and be eligible for alignment if their body is small enough?

kunalspathak · 2021-08-13T00:19:02Z

What about methods that are inlined?

Inlining decision happens in early phase of compilation and when we decide whether to align a loop or not, we already know that it was inlined or not. We won't align the loop if we know for sure that the method will not be inlined.

Dotnet-GitSync-Bot added area-Meta untriaged New issue has not been triaged by the area owner labels Oct 9, 2020

adamsitnik added the tenet-performance-benchmarks Issue from performance benchmark label Oct 9, 2020

JulieLeeMSFT assigned kunalspathak Oct 9, 2020

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Oct 9, 2020

JulieLeeMSFT added this to the 6.0.0 milestone Oct 9, 2020

JulieLeeMSFT added the Epic Groups multiple user stories. Can be grouped under a theme. label Oct 9, 2020

JulieLeeMSFT added Team Epic and removed Epic Groups multiple user stories. Can be grouped under a theme. labels Oct 19, 2020

JulieLeeMSFT closed this as completed Oct 19, 2020

JulieLeeMSFT reopened this Oct 19, 2020

adamsitnik mentioned this issue Oct 23, 2020

Enforce code alignment dotnet/BenchmarkDotNet#756

Closed

This was referenced Oct 26, 2020

[Arm64] Use stp and str (SIMD) for stack prolog zeroing #43789

Closed

New event to track memory allocation for JIT code #44030

Merged

Measure loop alignment's performance impact on Microbenchmarks #44051

Closed

kunalspathak mentioned this issue Nov 14, 2020

Align inner loops #44370

Merged

JulieLeeMSFT added User Story A single user-facing feature. Can be grouped under an epic. Bottom Up Work Not part of a theme, epic, or user story and removed Team Epic labels Nov 16, 2020

JulieLeeMSFT mentioned this issue Jan 28, 2021

What's new in .NET 6 Preview 1 dotnet/core#5853

Closed

This was referenced Feb 10, 2021

Do not align cloned loops #48090

Merged

Superpmi on Microbenchmarks #47900

Merged

JulieLeeMSFT mentioned this issue Feb 24, 2021

What's new in .NET 6 Preview 2 dotnet/core#5889

Closed

JulieLeeMSFT modified the milestones: 6.0.0, Future Mar 22, 2021

JulieLeeMSFT mentioned this issue Mar 26, 2021

What's new in .NET 6 Preview 3 dotnet/core#5890

Closed

kunalspathak mentioned this issue Apr 28, 2021

33% performance degradation in .NET 5 #44457

Closed

danmoseley mentioned this issue Jul 27, 2021

[Perf] Regressions in System.Collections.TryGetValueFalse<String, String> #51258

Closed

ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Aug 13, 2021

krwq removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Aug 13, 2021

ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Oct 4, 2021

buyaa-n removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Oct 7, 2021

kunalspathak mentioned this issue Oct 22, 2021

Hide 'align' instruction behind jmp #60787

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stabilize performance measurement #43227

Stabilize performance measurement #43227

kunalspathak commented Oct 9, 2020 •

edited

Loading

kunalspathak commented Oct 9, 2020

danmoseley commented Oct 9, 2020

danmoseley commented Oct 9, 2020

kunalspathak commented Oct 9, 2020

JulieLeeMSFT commented Nov 2, 2020

danmoseley commented Jan 27, 2021

kunalspathak commented Feb 4, 2021

JulieLeeMSFT commented Feb 4, 2021 via email

kunalspathak commented Feb 4, 2021

JulieLeeMSFT commented Mar 22, 2021

MithrilMan commented Aug 13, 2021

kunalspathak commented Aug 13, 2021

Stabilize performance measurement #43227

Stabilize performance measurement #43227

Comments

kunalspathak commented Oct 9, 2020 • edited Loading

Stability

Performance lab infrastructure

Reliable benchmarks collection

Code alignment work

Future work

Performance tooling work

kunalspathak commented Oct 9, 2020

danmoseley commented Oct 9, 2020

danmoseley commented Oct 9, 2020

kunalspathak commented Oct 9, 2020

JulieLeeMSFT commented Nov 2, 2020

danmoseley commented Jan 27, 2021

kunalspathak commented Feb 4, 2021

Stability

Performance

FannkuchRedux_9

MulMatrix

TryGetUInt16:

GetTypeCode

ContainsTrue/ContainsFalse

JulieLeeMSFT commented Feb 4, 2021 via email

kunalspathak commented Feb 4, 2021

JulieLeeMSFT commented Mar 22, 2021

MithrilMan commented Aug 13, 2021

kunalspathak commented Aug 13, 2021

kunalspathak commented Oct 9, 2020 •

edited

Loading