Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable QJFL and OSR by default for x64 #61934

Closed
wants to merge 4 commits into from

Conversation

AndyAyersMS
Copy link
Member

Change these default values when the jit targets x64:

  • COMPlus_TC_QuickJitForLoops=1
  • COMPlus_TC_OnStackReplacement=1

The upshot is that on x64 more methods will be jitted at Tier0, and
we will rely on OSR to get out of long-running Tier0 methods.

Other architectures continue to use the old behavior for now, as
OSR is not yet supported outside of x64.

Change these default values when the jit targets x64:

* COMPlus_TC_QuickJitForLoops=1
* COMPlus_TC_OnStackReplacement=1

The upshot is that on x64 more methods will be jitted at Tier0, and
we will rely on OSR to get out of long-running Tier0 methods.

Other architectures continue to use the old behavior for now, as
OSR is not yet supported outside of x64.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 22, 2021
@ghost
Copy link

ghost commented Nov 22, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Change these default values when the jit targets x64:

  • COMPlus_TC_QuickJitForLoops=1
  • COMPlus_TC_OnStackReplacement=1

The upshot is that on x64 more methods will be jitted at Tier0, and
we will rely on OSR to get out of long-running Tier0 methods.

Other architectures continue to use the old behavior for now, as
OSR is not yet supported outside of x64.

Author: AndyAyersMS
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@AndyAyersMS
Copy link
Member Author

cc @dotnet/jit-contrib

Will be running various stress legs to try and spot areas where we still need work.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

The pgo legs have expected failures, so will have to interpret results manually.

Had planned to run jitstress too but that seems to be in bad shape right now.

@AndyAyersMS
Copy link
Member Author

Runtime PGO failures are all "expected".

But libraries PGO has some failures that will need investigating. A number of parallel tests fail across all the pgo modes.

;; windows x64 no pgo

System.Threading.Tasks.Tests.ParallelForTests.RunSimpleParallelForIncrementTest(increms: 1024)
Assert.Equal() Failure
Expected: 1024
Actual:   1032

@AndyAyersMS
Copy link
Member Author

Simple repro

using System;
using System.Threading;
using System.Threading.Tasks;

class X
{
    public static void RunSimpleParallelForeachAddTest_Array(int count, out int o_counter, out int o_expectCounter)
    {
        var data = new int[count];
        int expectCounter = 0;
        for (int i = 0; i < data.Length; i++)
        {
            data[i] = i;
            expectCounter = unchecked(expectCounter + i);
        }
        
        int counter = 0;
        
        // run inside of a separate task mgr to isolate impacts to other tests.
        Task t = Task.Run(
            delegate
            {
                Parallel.ForEach(data, (x) => Interlocked.Add(ref counter, x));
            });
        t.Wait();

        o_counter = counter;
        o_expectCounter = expectCounter;
    }

    public static void Main()
    {
        int counter;
        int expectCounter;
        RunSimpleParallelForeachAddTest_Array(100, out counter, out expectCounter);
        Console.WriteLine($"got {counter} expected {expectCounter}");
    }
}

This is the same as the libraries test but with smaller data size. Run with a variant of OSR stress to force OSR to happen even at lower iteration counts:

complus_OSR_HitLimit=0
complus_TC_OnStackReplacement_InitialCounter=10

The results are wrong and vary from run to run

got 5001 expected 4950
got 5129 expected 4950
got 5001 expected 4950
got 5049 expected 4950
got 5129 expected 4950

Two methods get rejitted with OSR:

Compiling    8 X::RunSimpleParallelForeachAddTest_Array, IL size = 93, hash=0xa33b9cbc Tier1-OSR @0x18
Compiling   53 <>c__DisplayClass19_0`1[__Canon][System.__Canon]::<ForWorker>b__1, IL size = 575, hash=0x4b0449a5 Tier1-OSR @0xcf

I've looked at the first one in some detail and it seems ok. So likely the issue is in the second one.

@AndyAyersMS
Copy link
Member Author

The ForWorker method is fairly complex and has OSR step blocks. It looks like the OSR entry state var is not getting zero initialized despite being marked lvMustInit -- suspect this setting gets over-ridden later on. So, will explicitly zero this instead.

With that fixed, I'm running into another issue with OSR in the wider set of parallel tests) where the scratch BB is ending up as BBJ_COND.

Found when trying to enable OSR by default.

* Explicitly initalize the OSR step variable.
* Prevent `fgOptimizeUncondBranchToSimpleCond` from changing the
scratch entry BB to have conditional flow.
@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@BruceForstall
Copy link
Member

I'm running into another issue with OSR in the wider set of parallel tests) where the scratch BB is ending up as BBJ_COND.

We recently had a conversation (on GitHub, I believe) where you pointed out that the scratch entry BB could be BBJ_COND for OSR. Are you saying now that doing so creates a bad condition and so the scratch BB should NOT ever be BBJ_COND?

@AndyAyersMS
Copy link
Member Author

out that the scratch entry BB could be BBJ_COND for OSR

You mean this? I wasn't specific. The other option we support now (for OSR) is BBJ_ALWAYS

@AndyAyersMS
Copy link
Member Author

There's at least one more issue to fix -- now that OSR + PGO does instrumentation + optimization (see #61453), we may end up putting an instrumentation probe into the detached BBJ_RETURN block after a tail call. This leads to an assert in morph as we don't expect to see real IR there. The fix is to "move this probe" up to the previous block.

@AndyAyersMS
Copy link
Member Author

Repro for the case noted above. T triggers OSR and the OSR method is instrumented, we put an edge probe in the return block and trigger the assert. Loop in F is there just to keep it from getting inlined w/o marking it noinline (so it ends up being tail called).

;; with this PR
using System;
using System.Runtime.CompilerServices;

class X
{
    static int s;
    static int N;

    public static void F(int[] a)
    {
        for (int j = 0; j < N; j++)
        {
            for (int i = 0; i < a.Length; i++)
            {
                s -= a[i];
            }
        }
    }

    public static void T(bool p, int[] a)
    {
        if (p)
        {
            for (int j = 0; j < N; j++)
            {
                for (int i = 0; i < a.Length; i++)
                {
                    s += a[i];
                }
            }
            
            F(a);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Main()
    {
        int[] a = new int[1000];
        N = 100;
        s = 100;
        a[3] = 33;
        a[997] = 67;
        T(true, a);
        return s;
    }
}

results in

Assert failure(PID 7716 [0x00001e24], Thread: 19196 [0x4afc]): Assertion failed 'retExpr->gtOper == GT_RETURN' in 'X:T(bool,System.Int32[])' during 'Morph - Global' (IL size 54)

    File: C:\repos\runtime0\src\coreclr\jit\morph.cpp Line: 18007

@AndyAyersMS
Copy link
Member Author

Not clear yet how to fix the case above -- trying to detect and handle this during instrumentation seems iffy. And the return block has multiple predecessors, so "moving" the probe is not quite the right fix.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this pull request Dec 2, 2021
When both OSR and PGO are enabled, the jit will add PGO probes to OSR methods.
And if the OSR method also has a tail call, the jit must take care to not add
block probes to any return block reachable from possible tail call blocks.

Instead, instrumentation should create copies of the return block probe in each
return block predecessor (possibly splitting critical edges to make this viable).

Because all this happens early on, there are no pred lists. The analysis leverages
cheap preds instead, which means it needs to handle cases where a given pred has
multiple pred list entries. And it must also be aware that the OSR method's actual
flowgraph is a subgraph of the full initial graph.

This came up while scouting what it would take to enable OSR by default.
See dotnet#61934.
AndyAyersMS added a commit that referenced this pull request Dec 2, 2021
When both OSR and PGO are enabled, the jit will add PGO probes to OSR methods.
And if the OSR method also has a tail call, the jit must take care to not add
block probes to any return block reachable from possible tail call blocks.

Instead, instrumentation should create copies of the return block probe in each
return block predecessor (possibly splitting critical edges to make this viable).

Because all this happens early on, there are no pred lists. The analysis leverages
cheap preds instead, which means it needs to handle cases where a given pred has
multiple pred list entries. And it must also be aware that the OSR method's actual
flowgraph is a subgraph of the full initial graph.

This came up while scouting what it would take to enable OSR by default.
See #61934.
@AndyAyersMS
Copy link
Member Author

Merged to pick up #62263 and fixes for jit stress. Will kick off another round of stress testing.

@AndyAyersMS
Copy link
Member Author

Two Pri0 tests failing with known issue #62285.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr libraries-pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Time to trigger jit stress. This also has some known failures so I expect it to fail and will have to dig through and see if there's anything novel.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr jitstress

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Jitstress failures are the "Known" issues with create span.

Libraries PGO looks good.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr pgo

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Going to run GC stress but it has a lot of failures from #62067. So also expect it to fail and will have to sort through the results.

@AndyAyersMS
Copy link
Member Author

/azp run runtime-coreclr gcstress0x3-gcstress0xc

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@AndyAyersMS
Copy link
Member Author

Merge to pick up latest round of fixes. Net change is now just updating two config switches.

@AndyAyersMS
Copy link
Member Author

Runtime failures are #62285.

@AndyAyersMS
Copy link
Member Author

Things are looking good from correctness standpoint.

Still need to look at perf and startup.

@AndyAyersMS
Copy link
Member Author

For perf, first looking at benchmark games on windows as run in the perf repo (via BDN).

BinaryTrees_2 reports as being slower:

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
BinaryTrees_2 Job-ZQIGQA OSR 115.85 ms 2.382 ms 2.743 ms 115.82 ms 111.67 ms 120.66 ms 1.21 0.04 38000.0000 3000.0000 1000.0000 227 MB
BinaryTrees_2 Job-XHSMLW Default 95.67 ms 1.867 ms 2.075 ms 95.17 ms 93.04 ms 99.88 ms 1.00 0.00 38000.0000 500.0000 - 227 MB

and FannkuchRedux_2 reports being faster:

Method Job Toolchain n expectedSum Mean Error StdDev Median Min Max Ratio RatioSD Allocated
FannkuchRedux_2 Job-YJLNSF OSR 10 73196 141.9 ms 8.88 ms 10.23 ms 137.0 ms 130.0 ms 155.1 ms 0.90 0.06 464 B
FannkuchRedux_2 Job-QSMCVR Default 10 73196 153.0 ms 1.39 ms 1.30 ms 152.7 ms 151.2 ms 155.0 ms 1.00 0.00 464 B

and RegexRedux_5 slows down when intepreted and speeds up when it's compiled:

Method Job Toolchain options Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
RegexRedux_5 Job-UMNTZP OSR None 33.148 ms 1.5822 ms 1.7586 ms 33.026 ms 29.890 ms 36.777 ms 1.11 0.08 - - - 3 MB
RegexRedux_5 Job-JQLQSH Default None 29.981 ms 1.6348 ms 1.8171 ms 29.568 ms 26.665 ms 34.004 ms 1.00 0.00 - - - 3 MB
RegexRedux_5 Job-UMNTZP OSR Compiled 6.114 ms 0.1207 ms 0.1008 ms 6.139 ms 5.937 ms 6.253 ms 0.89 0.09 138.8889 138.8889 138.8889 3 MB
RegexRedux_5 Job-JQLQSH Default Compiled 7.503 ms 0.9650 ms 1.1113 ms 6.959 ms 6.521 ms 9.839 ms 1.00 0.00 210.5263 210.5263 210.5263 3 MB

Generally speaking, for benchmark perf we'd expect to see very similar numbers. Benchmarking (via BDN) should be measuring Tier1 code, so the fact that we may also generate OSR methods shouldn't matter.

So will drill into these.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Dec 8, 2021

For BinaryTrees_2 with OSR, we fairly quickly recompile Bench with OSR, and then never produce a Tier1 version. Current default will compile Bench fully optimized (thanks to QJFL=0). The difference here is evidently the delta between a fully optimized and an OSR compiled Bench.

So, two questions:

  1. why doesn't the OSR case eventually get to Tier1?
  2. why is the OSR version of Bench slower than the fully optimized one?

For (1) -- running with a checked jit so that we're collecting block counts and forcing everything through tier0, we see Bench is only called 19 times in total. So it isn't called often enough to tier up.

For future reference here's exactly how to do this.

dotnet run -c Release -f net6.0 -- --filter BenchmarksGame.BinaryTrees_2.RunBench --corerun c:\repos\runtime4\artifacts\tests\coreclr\windows.x64.checked\Tests\Core_Root\corerun.exe  --envVars COMPlus_TC_QuickJitForLoops:1 COMPlus_PGODataPath:d:\bugs\osrx64default\pgo.block.data COMPlus_WritePGOData:1 COMPlus_JitEdgeProfiling:0 COMPlus_JitCollect64BitCounts:1 COMPlus_TC_CallCounting:0

That saves the PGO data to a file in text format -- and in the file we find for `Bench':

@@@ codehash 0xB67FD712 methodhash 0xC7ADD92F ilSize 0x000000CE records 0x0000000D
MethodName: BenchmarksGame.BinaryTrees_2.Bench
Signature: int32  (int32,bool)
Schema InstrumentationKind 66 ILOffset 0 Count 1 Other 0
19 0
Schema InstrumentationKind 66 ILOffset 33 Count 1 Other 0
0 0
Schema InstrumentationKind 66 ILOffset 55 Count 1 Other 0
19 0
Schema InstrumentationKind 66 ILOffset 68 Count 1 Other 0
133 0
Schema InstrumentationKind 66 ILOffset 88 Count 1 Other 0
1660144 0
Schema InstrumentationKind 66 ILOffset 113 Count 1 Other 0
1660277 0
Schema InstrumentationKind 66 ILOffset 119 Count 1 Other 0
133 0
Schema InstrumentationKind 66 ILOffset 126 Count 1 Other 0
0 0
Schema InstrumentationKind 66 ILOffset 156 Count 1 Other 0
133 0
Schema InstrumentationKind 66 ILOffset 162 Count 1 Other 0
152 0
Schema InstrumentationKind 66 ILOffset 167 Count 1 Other 0
19 0
Schema InstrumentationKind 66 ILOffset 182 Count 1 Other 0
0 0
Schema InstrumentationKind 66 ILOffset 204 Count 1 Other 0
19 0

So the method was called 19 times, and the innermost loop executed (on average) 1660144/19 = 87,376 iterations per call.

For (2) -- the OSR method body code is very similar to that for the optimized version.

Profiling shows almost no time spent in Bench, and no appreciable time in the patchpoint helper.

Most of the time is in TreeNode.itemCheck and TreeNode.bottomUpTree. In both versions these start out at Tier0 and get rejitted to Tier1. Presumably the codegen for these matches up.

But per profiling, we see quite a bit more time in these methods in the OSR run.

BDN decides to do two calls to Bench per iteration for the default run, and only one for the OSR run. Presumably this happens because the default run initial tests show that two calls are needed to get close to the goal of 250ms -- recall Bench is optimized here -- while for OSR only one call is needed.

;; default run

OverheadJitting  1: 1 op, 299000.00 ns, 299.0000 us/op
WorkloadJitting  1: 1 op, 155075400.00 ns, 155.0754 ms/op    // not long enough -- explore trying iterations

WorkloadPilot    1: 4 op, 412094200.00 ns, 103.0236 ms/op     // 4 is too many
WorkloadPilot    2: 2 op, 201633400.00 ns, 100.8167 ms/op     // 2 seems like a good number

WorkloadWarmup   1: 2 op, 204065100.00 ns, 102.0326 ms/op

;; OSR run

OverheadJitting  1: 1 op, 301500.00 ns, 301.5000 us/op
WorkloadJitting  1: 1 op, 190016800.00 ns, 190.0168 ms/op   // long enough, no iterations needed

WorkloadWarmup   1: 1 op, 175646800.00 ns, 175.6468 ms/op

Note in the default that the time per iteration drops quite a bit 155 -> 100 when going from one to two calls.

My guess is that this is the main factor in the perf difference because doing two calls per iteration alters how the benchmark interacts with GC (it is allocation intensive) and somehow, on average, this improves performance.

My homegrown ETL analysis shows something similar. The main discrepancy is time within the runtime itself.

;; default

Benchmark: found 20 intervals; mean interval 198.533ms   (99.26ms)

02.75%   1.1E+06     ?        Unknown
48.45%   1.939E+07   native   coreclr.dll
25.04%   1.002E+07   Tier-1   [MicroBenchmarks]BinaryTrees_2+TreeNode.itemCheck()
21.11%   8.45E+06    Tier-1   [MicroBenchmarks]BinaryTrees_2+TreeNode.bottomUpTree(int32)
01.22%   4.9E+05     native   ntoskrnl.exe
00.55%   2.2E+05     native   clrjit.dll
00.42%   1.7E+05     native   ntdll.dll
00.35%   1.4E+05     FullOpt  [MicroBenchmarks]BinaryTrees_2.Bench(int32,bool)

;; default / 2

Benchmark: found 20 intervals; mean interval 99.26ms

2.75%	5.50E+05	?	Unknown
48.45%	9.70E+06	native	coreclr.dll
25.04%	5.01E+06	Tier-1	[MicroBenchmarks]BinaryTrees_2+TreeNode.itemCheck()
21.11%	4.23E+06	Tier-1	[MicroBenchmarks]BinaryTrees_2+TreeNode.bottomUpTree(int32)
1.22%	2.45E+05	native	ntoskrnl.exe
0.55%	1.10E+05	native	clrjit.dll
0.42%	8.50E+04	native	ntdll.dll
0.35%	7.00E+04	FullOpt	[MicroBenchmarks]BinaryTrees_2.Bench(int32,bool)

;; OSR

Benchmark: found 20 intervals; mean interval 118.196ms

02.37%   5.7E+05     ?        Unknown
53.35%   1.282E+07   native   coreclr.dll
21.64%   5.2E+06     Tier-1   [MicroBenchmarks]BinaryTrees_2+TreeNode.itemCheck()
19.14%   4.6E+06     Tier-1   [MicroBenchmarks]BinaryTrees_2+TreeNode.bottomUpTree(int32)
01.83%   4.4E+05     native   ntoskrnl.exe
01.00%   2.4E+05     native   clrjit.dll
00.46%   1.1E+05     native   ntdll.dll
00.12%   3E+04       OSR      [MicroBenchmarks]BinaryTrees_2.Bench(int32,bool)

This decision on how many invocations per iteration can be fixed with --unrollFactor. Setting this to 2 (which a bit oddly leads BDN to make four calls per iteration):

Method Job Toolchain Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
BinaryTrees_2 Job-ODKACC OSR 104.05 ms 12.890 ms 14.328 ms 96.51 ms 95.14 ms 134.96 ms 1.19 0.18 38000.0000 750.0000 250.0000 227 MB
BinaryTrees_2 Job-RVAXEO default 91.19 ms 0.505 ms 0.394 ms 91.33 ms 90.53 ms 91.67 ms 1.00 0.00 38000.0000 750.0000 250.0000 227 MB

where note the median times for the two are now fairly close.

The per iteration times for the OSR version show that the Tier1 code for Bench is not available until the 5th iteration:

WorkloadActual   1: 4 op, 512118000.00 ns, 128.0295 ms/op
WorkloadActual   2: 4 op, 520523500.00 ns, 130.1309 ms/op
WorkloadActual   3: 4 op, 505522600.00 ns, 126.3807 ms/op
WorkloadActual   4: 4 op, 516531600.00 ns, 129.1329 ms/op
WorkloadActual   5: 4 op, 487114100.00 ns, 121.7785 ms/op
WorkloadActual   6: 4 op, 374112200.00 ns, 93.5280 ms/op
WorkloadActual   7: 4 op, 389700000.00 ns, 97.4250 ms/op
WorkloadActual   8: 4 op, 385259300.00 ns, 96.3148 ms/op
WorkloadActual   9: 4 op, 395792600.00 ns, 98.9481 ms/op
WorkloadActual  10: 4 op, 392235500.00 ns, 98.0589 ms/op
WorkloadActual  11: 4 op, 385404600.00 ns, 96.3512 ms/op
WorkloadActual  12: 4 op, 398635600.00 ns, 99.6589 ms/op
WorkloadActual  13: 4 op, 388586400.00 ns, 97.1466 ms/op
WorkloadActual  14: 4 op, 395079700.00 ns, 98.7699 ms/op
WorkloadActual  15: 4 op, 388083400.00 ns, 97.0208 ms/op
WorkloadActual  16: 4 op, 386997800.00 ns, 96.7494 ms/op
WorkloadActual  17: 4 op, 386348600.00 ns, 96.5871 ms/op
WorkloadActual  18: 4 op, 389505000.00 ns, 97.3762 ms/op
WorkloadActual  19: 4 op, 396823800.00 ns, 99.2060 ms/op
WorkloadActual  20: 4 op, 383543500.00 ns, 95.8859 ms/op

So the mean value above for OSR shows a blend of Tier0, OSR and Tier1 times.

Despite all that the final iterations for OSR are consistently a bit slower (~5%) than the default iterations. Not clear why.

But it seems the predominant factor here is BDN's strategy changing because Bench is no longer fully optimized with QJFL=1.

@AndyAyersMS
Copy link
Member Author

For FannkuchRedux_2 we see the OSR method running faster than the Tier1 method. Here are the iteration times:

WorkloadResult   1: 2 op, 294451000.00 ns, 147.2255 ms/op
   ;; OSR version of fannkuch
WorkloadResult   2: 2 op, 268075800.00 ns, 134.0379 ms/op
WorkloadResult   3: 2 op, 270717600.00 ns, 135.3588 ms/op
WorkloadResult   4: 2 op, 275813700.00 ns, 137.9068 ms/op
WorkloadResult   5: 2 op, 267628700.00 ns, 133.8143 ms/op
WorkloadResult   6: 2 op, 268275400.00 ns, 134.1377 ms/op
WorkloadResult   7: 2 op, 268396300.00 ns, 134.1981 ms/op
WorkloadResult   8: 2 op, 267082500.00 ns, 133.5412 ms/op
WorkloadResult   9: 2 op, 273082600.00 ns, 136.5413 ms/op
WorkloadResult  10: 2 op, 267407200.00 ns, 133.7036 ms/op
WorkloadResult  11: 2 op, 267643800.00 ns, 133.8219 ms/op
    ;; Tier1 version of fannkuch 
WorkloadResult  12: 2 op, 302364000.00 ns, 151.1820 ms/op
WorkloadResult  13: 2 op, 300932300.00 ns, 150.4661 ms/op
WorkloadResult  14: 2 op, 303645700.00 ns, 151.8228 ms/op
WorkloadResult  15: 2 op, 299903400.00 ns, 149.9517 ms/op
WorkloadResult  16: 2 op, 299843600.00 ns, 149.9218 ms/op
WorkloadResult  17: 2 op, 298525300.00 ns, 149.2627 ms/op
WorkloadResult  18: 2 op, 321872300.00 ns, 160.9361 ms/op
WorkloadResult  19: 2 op, 316387400.00 ns, 158.1937 ms/op
WorkloadResult  20: 2 op, 313691800.00 ns, 156.8459 ms/op

The Tier1 method is loaded in the middle of iteration 11, so gets called for iteration 12. The "slow" iteration times after that point are similar to those seen by the default config (perhaps a tiny bit better).

So naturally, the question is why. All the time here is spent in fannkuch.

First, the Tier1 and full opt version are identical. The OSR version just omits the initial for loop:

https://github.com/dotnet/performance/blob/529f33c1955ae3360d794d1fc80dfb978bb2f222/src/benchmarks/micro/runtime/BenchmarksGame/fannkuch-redux-2.cs#L32

but contains the rest of the loops. We do less cloning in the OSR version. That seems to lead to somewhat better register allocation, though unclear if that accounts for all of the perf improvement.

Sorting out why things are faster for OSR is going to take some time.

@AndyAyersMS
Copy link
Member Author

Running some other segments of the full suite at random:

  • Utf8Formatter similar results with/without OSR
  • Sort shows some +/- 20% swings.

At this point it seems clear that there are some tests -- almost certainly the ones that do a lot of looping internally, possibly restricted to the subset where the looping methods aren't called a lot -- where the BDN results will reflect the performance of the OSR versions. And the OSR version can be faster or slower in ways that will be hard to predict.

By and large all these should be cases where the benchmark strategy isn't sufficient to get us to Tier1 reliably. In the past for this subset it did not matter, as QJFL=0 ensured these methods were exempt from tiering.

I'm going to try and do larger sweeps and get a ballpark estimate as to how many tests are likely impacted in this way.

@EgorBo
Copy link
Member

EgorBo commented Dec 9, 2021

I hope dotnet/performance infrastructure will be in a good state when this is merged to catch all improvements/regressions

@AndyAyersMS
Copy link
Member Author

I hope dotnet/performance infrastructure will be in a good state when this is merged

I won't merge unless things are in a good state.

@ghost ghost closed this Jan 8, 2022
@ghost
Copy link

ghost commented Jan 8, 2022

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

@ghost ghost locked as resolved and limited conversation to collaborators Feb 8, 2022
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants