Enable QJFL and OSR by default for x64 #61934

AndyAyersMS · 2021-11-22T19:36:31Z

Change these default values when the jit targets x64:

COMPlus_TC_QuickJitForLoops=1
COMPlus_TC_OnStackReplacement=1

The upshot is that on x64 more methods will be jitted at Tier0, and
we will rely on OSR to get out of long-running Tier0 methods.

Other architectures continue to use the old behavior for now, as
OSR is not yet supported outside of x64.

Change these default values when the jit targets x64: * COMPlus_TC_QuickJitForLoops=1 * COMPlus_TC_OnStackReplacement=1 The upshot is that on x64 more methods will be jitted at Tier0, and we will rely on OSR to get out of long-running Tier0 methods. Other architectures continue to use the old behavior for now, as OSR is not yet supported outside of x64.

ghost · 2021-11-22T19:36:37Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Change these default values when the jit targets x64:

COMPlus_TC_QuickJitForLoops=1
COMPlus_TC_OnStackReplacement=1

The upshot is that on x64 more methods will be jitted at Tier0, and
we will rely on OSR to get out of long-running Tier0 methods.

Other architectures continue to use the old behavior for now, as
OSR is not yet supported outside of x64.

Author:	AndyAyersMS
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS · 2021-11-22T19:37:08Z

cc @dotnet/jit-contrib

Will be running various stress legs to try and spot areas where we still need work.

AndyAyersMS · 2021-11-22T21:38:07Z

/azp run runtime-coreclr libraries-pgo

AndyAyersMS · 2021-11-22T21:38:34Z

/azp run runtime-coreclr pgo

azure-pipelines · 2021-11-22T21:38:36Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2021-11-22T21:38:51Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2021-11-22T21:49:16Z

The pgo legs have expected failures, so will have to interpret results manually.

Had planned to run jitstress too but that seems to be in bad shape right now.

AndyAyersMS · 2021-11-23T16:02:57Z

Runtime PGO failures are all "expected".

But libraries PGO has some failures that will need investigating. A number of parallel tests fail across all the pgo modes.

;; windows x64 no pgo

System.Threading.Tasks.Tests.ParallelForTests.RunSimpleParallelForIncrementTest(increms: 1024)
Assert.Equal() Failure
Expected: 1024
Actual:   1032

AndyAyersMS · 2021-11-23T20:39:43Z

Simple repro

using System;
using System.Threading;
using System.Threading.Tasks;

class X
{
    public static void RunSimpleParallelForeachAddTest_Array(int count, out int o_counter, out int o_expectCounter)
    {
        var data = new int[count];
        int expectCounter = 0;
        for (int i = 0; i < data.Length; i++)
        {
            data[i] = i;
            expectCounter = unchecked(expectCounter + i);
        }
        
        int counter = 0;
        
        // run inside of a separate task mgr to isolate impacts to other tests.
        Task t = Task.Run(
            delegate
            {
                Parallel.ForEach(data, (x) => Interlocked.Add(ref counter, x));
            });
        t.Wait();

        o_counter = counter;
        o_expectCounter = expectCounter;
    }

    public static void Main()
    {
        int counter;
        int expectCounter;
        RunSimpleParallelForeachAddTest_Array(100, out counter, out expectCounter);
        Console.WriteLine($"got {counter} expected {expectCounter}");
    }
}

This is the same as the libraries test but with smaller data size. Run with a variant of OSR stress to force OSR to happen even at lower iteration counts:

complus_OSR_HitLimit=0
complus_TC_OnStackReplacement_InitialCounter=10

The results are wrong and vary from run to run

got 5001 expected 4950
got 5129 expected 4950
got 5001 expected 4950
got 5049 expected 4950
got 5129 expected 4950

Two methods get rejitted with OSR:

Compiling    8 X::RunSimpleParallelForeachAddTest_Array, IL size = 93, hash=0xa33b9cbc Tier1-OSR @0x18
Compiling   53 <>c__DisplayClass19_0`1[__Canon][System.__Canon]::<ForWorker>b__1, IL size = 575, hash=0x4b0449a5 Tier1-OSR @0xcf

I've looked at the first one in some detail and it seems ok. So likely the issue is in the second one.

AndyAyersMS · 2021-11-23T21:10:23Z

The ForWorker method is fairly complex and has OSR step blocks. It looks like the OSR entry state var is not getting zero initialized despite being marked lvMustInit -- suspect this setting gets over-ridden later on. So, will explicitly zero this instead.

With that fixed, I'm running into another issue with OSR in the wider set of parallel tests) where the scratch BB is ending up as BBJ_COND.

Found when trying to enable OSR by default. * Explicitly initalize the OSR step variable. * Prevent `fgOptimizeUncondBranchToSimpleCond` from changing the scratch entry BB to have conditional flow.

AndyAyersMS · 2021-11-23T22:01:34Z

/azp run runtime-coreclr libraries-pgo

azure-pipelines · 2021-11-23T22:01:47Z

Azure Pipelines successfully started running 1 pipeline(s).

BruceForstall · 2021-11-24T00:32:06Z

I'm running into another issue with OSR in the wider set of parallel tests) where the scratch BB is ending up as BBJ_COND.

We recently had a conversation (on GitHub, I believe) where you pointed out that the scratch entry BB could be BBJ_COND for OSR. Are you saying now that doing so creates a bad condition and so the scratch BB should NOT ever be BBJ_COND?

AndyAyersMS · 2021-11-24T01:03:00Z

out that the scratch entry BB could be BBJ_COND for OSR

You mean this? I wasn't specific. The other option we support now (for OSR) is BBJ_ALWAYS

AndyAyersMS · 2021-11-24T01:06:45Z

There's at least one more issue to fix -- now that OSR + PGO does instrumentation + optimization (see #61453), we may end up putting an instrumentation probe into the detached BBJ_RETURN block after a tail call. This leads to an assert in morph as we don't expect to see real IR there. The fix is to "move this probe" up to the previous block.

AndyAyersMS · 2021-11-24T01:25:45Z

Repro for the case noted above. T triggers OSR and the OSR method is instrumented, we put an edge probe in the return block and trigger the assert. Loop in F is there just to keep it from getting inlined w/o marking it noinline (so it ends up being tail called).

;; with this PR
using System;
using System.Runtime.CompilerServices;

class X
{
    static int s;
    static int N;

    public static void F(int[] a)
    {
        for (int j = 0; j < N; j++)
        {
            for (int i = 0; i < a.Length; i++)
            {
                s -= a[i];
            }
        }
    }

    public static void T(bool p, int[] a)
    {
        if (p)
        {
            for (int j = 0; j < N; j++)
            {
                for (int i = 0; i < a.Length; i++)
                {
                    s += a[i];
                }
            }
            
            F(a);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Main()
    {
        int[] a = new int[1000];
        N = 100;
        s = 100;
        a[3] = 33;
        a[997] = 67;
        T(true, a);
        return s;
    }
}

results in

Assert failure(PID 7716 [0x00001e24], Thread: 19196 [0x4afc]): Assertion failed 'retExpr->gtOper == GT_RETURN' in 'X:T(bool,System.Int32[])' during 'Morph - Global' (IL size 54)

    File: C:\repos\runtime0\src\coreclr\jit\morph.cpp Line: 18007

AndyAyersMS · 2021-11-24T01:54:10Z

Not clear yet how to fix the case above -- trying to detect and handle this during instrumentation seems iffy. And the return block has multiple predecessors, so "moving" the probe is not quite the right fix.

When both OSR and PGO are enabled, the jit will add PGO probes to OSR methods. And if the OSR method also has a tail call, the jit must take care to not add block probes to any return block reachable from possible tail call blocks. Instead, instrumentation should create copies of the return block probe in each return block predecessor (possibly splitting critical edges to make this viable). Because all this happens early on, there are no pred lists. The analysis leverages cheap preds instead, which means it needs to handle cases where a given pred has multiple pred list entries. And it must also be aware that the OSR method's actual flowgraph is a subgraph of the full initial graph. This came up while scouting what it would take to enable OSR by default. See dotnet#61934.

When both OSR and PGO are enabled, the jit will add PGO probes to OSR methods. And if the OSR method also has a tail call, the jit must take care to not add block probes to any return block reachable from possible tail call blocks. Instead, instrumentation should create copies of the return block probe in each return block predecessor (possibly splitting critical edges to make this viable). Because all this happens early on, there are no pred lists. The analysis leverages cheap preds instead, which means it needs to handle cases where a given pred has multiple pred list entries. And it must also be aware that the OSR method's actual flowgraph is a subgraph of the full initial graph. This came up while scouting what it would take to enable OSR by default. See #61934.

AndyAyersMS · 2021-12-02T23:03:15Z

Merged to pick up #62263 and fixes for jit stress. Will kick off another round of stress testing.

AndyAyersMS · 2021-12-03T01:50:57Z

Two Pri0 tests failing with known issue #62285.

AndyAyersMS · 2021-12-03T01:51:16Z

/azp run runtime-coreclr libraries-pgo

azure-pipelines · 2021-12-03T01:51:29Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2021-12-03T01:55:14Z

Time to trigger jit stress. This also has some known failures so I expect it to fail and will have to dig through and see if there's anything novel.

AndyAyersMS · 2021-12-03T01:55:39Z

/azp run runtime-coreclr jitstress

azure-pipelines · 2021-12-03T01:55:54Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2021-12-03T17:45:04Z

Jitstress failures are the "Known" issues with create span.

Libraries PGO looks good.

AndyAyersMS · 2021-12-03T17:45:13Z

/azp run runtime-coreclr pgo

azure-pipelines · 2021-12-03T17:45:51Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2021-12-03T17:48:24Z

Going to run GC stress but it has a lot of failures from #62067. So also expect it to fail and will have to sort through the results.

AndyAyersMS · 2021-12-03T17:48:51Z

/azp run runtime-coreclr gcstress0x3-gcstress0xc

azure-pipelines · 2021-12-03T17:49:07Z

Azure Pipelines successfully started running 1 pipeline(s).

AndyAyersMS · 2021-12-06T20:12:43Z

Merge to pick up latest round of fixes. Net change is now just updating two config switches.

AndyAyersMS · 2021-12-07T01:17:53Z

Runtime failures are #62285.

AndyAyersMS · 2021-12-07T01:19:04Z

Things are looking good from correctness standpoint.

Still need to look at perf and startup.

AndyAyersMS · 2021-12-08T20:13:59Z

For perf, first looking at benchmark games on windows as run in the perf repo (via BDN).

BinaryTrees_2 reports as being slower:

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
BinaryTrees_2	Job-ZQIGQA	OSR	115.85 ms	2.382 ms	2.743 ms	115.82 ms	111.67 ms	120.66 ms	1.21	0.04	38000.0000	3000.0000	1000.0000	227 MB
BinaryTrees_2	Job-XHSMLW	Default	95.67 ms	1.867 ms	2.075 ms	95.17 ms	93.04 ms	99.88 ms	1.00	0.00	38000.0000	500.0000	-	227 MB

and FannkuchRedux_2 reports being faster:

Method	Job	Toolchain	n	expectedSum	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Allocated
FannkuchRedux_2	Job-YJLNSF	OSR	10	73196	141.9 ms	8.88 ms	10.23 ms	137.0 ms	130.0 ms	155.1 ms	0.90	0.06	464 B
FannkuchRedux_2	Job-QSMCVR	Default	10	73196	153.0 ms	1.39 ms	1.30 ms	152.7 ms	151.2 ms	155.0 ms	1.00	0.00	464 B

and RegexRedux_5 slows down when intepreted and speeds up when it's compiled:

Method	Job	Toolchain	options	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
RegexRedux_5	Job-UMNTZP	OSR	None	33.148 ms	1.5822 ms	1.7586 ms	33.026 ms	29.890 ms	36.777 ms	1.11	0.08	-	-	-	3 MB
RegexRedux_5	Job-JQLQSH	Default	None	29.981 ms	1.6348 ms	1.8171 ms	29.568 ms	26.665 ms	34.004 ms	1.00	0.00	-	-	-	3 MB
RegexRedux_5	Job-UMNTZP	OSR	Compiled	6.114 ms	0.1207 ms	0.1008 ms	6.139 ms	5.937 ms	6.253 ms	0.89	0.09	138.8889	138.8889	138.8889	3 MB
RegexRedux_5	Job-JQLQSH	Default	Compiled	7.503 ms	0.9650 ms	1.1113 ms	6.959 ms	6.521 ms	9.839 ms	1.00	0.00	210.5263	210.5263	210.5263	3 MB

Generally speaking, for benchmark perf we'd expect to see very similar numbers. Benchmarking (via BDN) should be measuring Tier1 code, so the fact that we may also generate OSR methods shouldn't matter.

So will drill into these.

AndyAyersMS · 2021-12-08T20:52:32Z

For BinaryTrees_2 with OSR, we fairly quickly recompile Bench with OSR, and then never produce a Tier1 version. Current default will compile Bench fully optimized (thanks to QJFL=0). The difference here is evidently the delta between a fully optimized and an OSR compiled Bench.

So, two questions:

why doesn't the OSR case eventually get to Tier1?
why is the OSR version of Bench slower than the fully optimized one?

For (1) -- running with a checked jit so that we're collecting block counts and forcing everything through tier0, we see Bench is only called 19 times in total. So it isn't called often enough to tier up.

For future reference here's exactly how to do this.

dotnet run -c Release -f net6.0 -- --filter BenchmarksGame.BinaryTrees_2.RunBench --corerun c:\repos\runtime4\artifacts\tests\coreclr\windows.x64.checked\Tests\Core_Root\corerun.exe  --envVars COMPlus_TC_QuickJitForLoops:1 COMPlus_PGODataPath:d:\bugs\osrx64default\pgo.block.data COMPlus_WritePGOData:1 COMPlus_JitEdgeProfiling:0 COMPlus_JitCollect64BitCounts:1 COMPlus_TC_CallCounting:0

That saves the PGO data to a file in text format -- and in the file we find for `Bench':

@@@ codehash 0xB67FD712 methodhash 0xC7ADD92F ilSize 0x000000CE records 0x0000000D
MethodName: BenchmarksGame.BinaryTrees_2.Bench
Signature: int32  (int32,bool)
Schema InstrumentationKind 66 ILOffset 0 Count 1 Other 0
19 0
Schema InstrumentationKind 66 ILOffset 33 Count 1 Other 0
0 0
Schema InstrumentationKind 66 ILOffset 55 Count 1 Other 0
19 0
Schema InstrumentationKind 66 ILOffset 68 Count 1 Other 0
133 0
Schema InstrumentationKind 66 ILOffset 88 Count 1 Other 0
1660144 0
Schema InstrumentationKind 66 ILOffset 113 Count 1 Other 0
1660277 0
Schema InstrumentationKind 66 ILOffset 119 Count 1 Other 0
133 0
Schema InstrumentationKind 66 ILOffset 126 Count 1 Other 0
0 0
Schema InstrumentationKind 66 ILOffset 156 Count 1 Other 0
133 0
Schema InstrumentationKind 66 ILOffset 162 Count 1 Other 0
152 0
Schema InstrumentationKind 66 ILOffset 167 Count 1 Other 0
19 0
Schema InstrumentationKind 66 ILOffset 182 Count 1 Other 0
0 0
Schema InstrumentationKind 66 ILOffset 204 Count 1 Other 0
19 0

So the method was called 19 times, and the innermost loop executed (on average) 1660144/19 = 87,376 iterations per call.

For (2) -- the OSR method body code is very similar to that for the optimized version.

Profiling shows almost no time spent in Bench, and no appreciable time in the patchpoint helper.

Most of the time is in TreeNode.itemCheck and TreeNode.bottomUpTree. In both versions these start out at Tier0 and get rejitted to Tier1. Presumably the codegen for these matches up.

But per profiling, we see quite a bit more time in these methods in the OSR run.

BDN decides to do two calls to Bench per iteration for the default run, and only one for the OSR run. Presumably this happens because the default run initial tests show that two calls are needed to get close to the goal of 250ms -- recall Bench is optimized here -- while for OSR only one call is needed.

;; default run

OverheadJitting  1: 1 op, 299000.00 ns, 299.0000 us/op
WorkloadJitting  1: 1 op, 155075400.00 ns, 155.0754 ms/op    // not long enough -- explore trying iterations

WorkloadPilot    1: 4 op, 412094200.00 ns, 103.0236 ms/op     // 4 is too many
WorkloadPilot    2: 2 op, 201633400.00 ns, 100.8167 ms/op     // 2 seems like a good number

WorkloadWarmup   1: 2 op, 204065100.00 ns, 102.0326 ms/op

;; OSR run

OverheadJitting  1: 1 op, 301500.00 ns, 301.5000 us/op
WorkloadJitting  1: 1 op, 190016800.00 ns, 190.0168 ms/op   // long enough, no iterations needed

WorkloadWarmup   1: 1 op, 175646800.00 ns, 175.6468 ms/op

Note in the default that the time per iteration drops quite a bit 155 -> 100 when going from one to two calls.

My guess is that this is the main factor in the perf difference because doing two calls per iteration alters how the benchmark interacts with GC (it is allocation intensive) and somehow, on average, this improves performance.

My homegrown ETL analysis shows something similar. The main discrepancy is time within the runtime itself.

;; default

Benchmark: found 20 intervals; mean interval 198.533ms   (99.26ms)

02.75%   1.1E+06     ?        Unknown
48.45%   1.939E+07   native   coreclr.dll
25.04%   1.002E+07   Tier-1   [MicroBenchmarks]BinaryTrees_2+TreeNode.itemCheck()
21.11%   8.45E+06    Tier-1   [MicroBenchmarks]BinaryTrees_2+TreeNode.bottomUpTree(int32)
01.22%   4.9E+05     native   ntoskrnl.exe
00.55%   2.2E+05     native   clrjit.dll
00.42%   1.7E+05     native   ntdll.dll
00.35%   1.4E+05     FullOpt  [MicroBenchmarks]BinaryTrees_2.Bench(int32,bool)

;; default / 2

Benchmark: found 20 intervals; mean interval 99.26ms

2.75%	5.50E+05	?	Unknown
48.45%	9.70E+06	native	coreclr.dll
25.04%	5.01E+06	Tier-1	[MicroBenchmarks]BinaryTrees_2+TreeNode.itemCheck()
21.11%	4.23E+06	Tier-1	[MicroBenchmarks]BinaryTrees_2+TreeNode.bottomUpTree(int32)
1.22%	2.45E+05	native	ntoskrnl.exe
0.55%	1.10E+05	native	clrjit.dll
0.42%	8.50E+04	native	ntdll.dll
0.35%	7.00E+04	FullOpt	[MicroBenchmarks]BinaryTrees_2.Bench(int32,bool)

;; OSR

Benchmark: found 20 intervals; mean interval 118.196ms

02.37%   5.7E+05     ?        Unknown
53.35%   1.282E+07   native   coreclr.dll
21.64%   5.2E+06     Tier-1   [MicroBenchmarks]BinaryTrees_2+TreeNode.itemCheck()
19.14%   4.6E+06     Tier-1   [MicroBenchmarks]BinaryTrees_2+TreeNode.bottomUpTree(int32)
01.83%   4.4E+05     native   ntoskrnl.exe
01.00%   2.4E+05     native   clrjit.dll
00.46%   1.1E+05     native   ntdll.dll
00.12%   3E+04       OSR      [MicroBenchmarks]BinaryTrees_2.Bench(int32,bool)

This decision on how many invocations per iteration can be fixed with --unrollFactor. Setting this to 2 (which a bit oddly leads BDN to make four calls per iteration):

Method	Job	Toolchain	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
BinaryTrees_2	Job-ODKACC	OSR	104.05 ms	12.890 ms	14.328 ms	96.51 ms	95.14 ms	134.96 ms	1.19	0.18	38000.0000	750.0000	250.0000	227 MB
BinaryTrees_2	Job-RVAXEO	default	91.19 ms	0.505 ms	0.394 ms	91.33 ms	90.53 ms	91.67 ms	1.00	0.00	38000.0000	750.0000	250.0000	227 MB

where note the median times for the two are now fairly close.

The per iteration times for the OSR version show that the Tier1 code for Bench is not available until the 5th iteration:

WorkloadActual   1: 4 op, 512118000.00 ns, 128.0295 ms/op
WorkloadActual   2: 4 op, 520523500.00 ns, 130.1309 ms/op
WorkloadActual   3: 4 op, 505522600.00 ns, 126.3807 ms/op
WorkloadActual   4: 4 op, 516531600.00 ns, 129.1329 ms/op
WorkloadActual   5: 4 op, 487114100.00 ns, 121.7785 ms/op
WorkloadActual   6: 4 op, 374112200.00 ns, 93.5280 ms/op
WorkloadActual   7: 4 op, 389700000.00 ns, 97.4250 ms/op
WorkloadActual   8: 4 op, 385259300.00 ns, 96.3148 ms/op
WorkloadActual   9: 4 op, 395792600.00 ns, 98.9481 ms/op
WorkloadActual  10: 4 op, 392235500.00 ns, 98.0589 ms/op
WorkloadActual  11: 4 op, 385404600.00 ns, 96.3512 ms/op
WorkloadActual  12: 4 op, 398635600.00 ns, 99.6589 ms/op
WorkloadActual  13: 4 op, 388586400.00 ns, 97.1466 ms/op
WorkloadActual  14: 4 op, 395079700.00 ns, 98.7699 ms/op
WorkloadActual  15: 4 op, 388083400.00 ns, 97.0208 ms/op
WorkloadActual  16: 4 op, 386997800.00 ns, 96.7494 ms/op
WorkloadActual  17: 4 op, 386348600.00 ns, 96.5871 ms/op
WorkloadActual  18: 4 op, 389505000.00 ns, 97.3762 ms/op
WorkloadActual  19: 4 op, 396823800.00 ns, 99.2060 ms/op
WorkloadActual  20: 4 op, 383543500.00 ns, 95.8859 ms/op

So the mean value above for OSR shows a blend of Tier0, OSR and Tier1 times.

Despite all that the final iterations for OSR are consistently a bit slower (~5%) than the default iterations. Not clear why.

But it seems the predominant factor here is BDN's strategy changing because Bench is no longer fully optimized with QJFL=1.

AndyAyersMS · 2021-12-09T03:05:05Z

For FannkuchRedux_2 we see the OSR method running faster than the Tier1 method. Here are the iteration times:

WorkloadResult   1: 2 op, 294451000.00 ns, 147.2255 ms/op
   ;; OSR version of fannkuch
WorkloadResult   2: 2 op, 268075800.00 ns, 134.0379 ms/op
WorkloadResult   3: 2 op, 270717600.00 ns, 135.3588 ms/op
WorkloadResult   4: 2 op, 275813700.00 ns, 137.9068 ms/op
WorkloadResult   5: 2 op, 267628700.00 ns, 133.8143 ms/op
WorkloadResult   6: 2 op, 268275400.00 ns, 134.1377 ms/op
WorkloadResult   7: 2 op, 268396300.00 ns, 134.1981 ms/op
WorkloadResult   8: 2 op, 267082500.00 ns, 133.5412 ms/op
WorkloadResult   9: 2 op, 273082600.00 ns, 136.5413 ms/op
WorkloadResult  10: 2 op, 267407200.00 ns, 133.7036 ms/op
WorkloadResult  11: 2 op, 267643800.00 ns, 133.8219 ms/op
    ;; Tier1 version of fannkuch 
WorkloadResult  12: 2 op, 302364000.00 ns, 151.1820 ms/op
WorkloadResult  13: 2 op, 300932300.00 ns, 150.4661 ms/op
WorkloadResult  14: 2 op, 303645700.00 ns, 151.8228 ms/op
WorkloadResult  15: 2 op, 299903400.00 ns, 149.9517 ms/op
WorkloadResult  16: 2 op, 299843600.00 ns, 149.9218 ms/op
WorkloadResult  17: 2 op, 298525300.00 ns, 149.2627 ms/op
WorkloadResult  18: 2 op, 321872300.00 ns, 160.9361 ms/op
WorkloadResult  19: 2 op, 316387400.00 ns, 158.1937 ms/op
WorkloadResult  20: 2 op, 313691800.00 ns, 156.8459 ms/op

The Tier1 method is loaded in the middle of iteration 11, so gets called for iteration 12. The "slow" iteration times after that point are similar to those seen by the default config (perhaps a tiny bit better).

So naturally, the question is why. All the time here is spent in fannkuch.

First, the Tier1 and full opt version are identical. The OSR version just omits the initial for loop:

https://github.com/dotnet/performance/blob/529f33c1955ae3360d794d1fc80dfb978bb2f222/src/benchmarks/micro/runtime/BenchmarksGame/fannkuch-redux-2.cs#L32

but contains the rest of the loops. We do less cloning in the OSR version. That seems to lead to somewhat better register allocation, though unclear if that accounts for all of the perf improvement.

Sorting out why things are faster for OSR is going to take some time.

AndyAyersMS · 2021-12-09T16:50:47Z

Running some other segments of the full suite at random:

Utf8Formatter similar results with/without OSR
Sort shows some +/- 20% swings.

At this point it seems clear that there are some tests -- almost certainly the ones that do a lot of looping internally, possibly restricted to the subset where the looping methods aren't called a lot -- where the BDN results will reflect the performance of the OSR versions. And the OSR version can be faster or slower in ways that will be hard to predict.

By and large all these should be cases where the benchmark strategy isn't sufficient to get us to Tier1 reliably. In the past for this subset it did not matter, as QJFL=0 ensured these methods were exempt from tiering.

I'm going to try and do larger sweeps and get a ballpark estimate as to how many tests are likely impacted in this way.

EgorBo · 2021-12-09T17:09:47Z

I hope dotnet/performance infrastructure will be in a good state when this is merged to catch all improvements/regressions

AndyAyersMS · 2021-12-09T17:29:07Z

I hope dotnet/performance infrastructure will be in a good state when this is merged

I won't merge unless things are in a good state.

ghost · 2022-01-08T23:00:52Z

Draft Pull Request was automatically closed for inactivity. Please let us know if you'd like to reopen it.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 22, 2021

AndyAyersMS mentioned this pull request Nov 22, 2021

On Stack Replacement Next Steps #33658

Open

72 tasks

JIT: two fixes for OSR

b5b7fcd

Found when trying to enable OSR by default. * Explicitly initalize the OSR step variable. * Prevent `fgOptimizeUncondBranchToSimpleCond` from changing the scratch entry BB to have conditional flow.

AndyAyersMS mentioned this pull request Dec 2, 2021

JIT: handle interaction of OSR, PGO, and tail calls #62263

Merged

Merge branch 'main' into OSROnByDefaultForx64

772f7c3

runfoapp bot mentioned this pull request Dec 3, 2021

jit/intrinsics/createspan_il/createspan_il.sh #62285

Closed

AndyAyersMS mentioned this pull request Dec 3, 2021

JIT: three fixes for OSR #62367

Merged

Merge remote-tracking branch 'upstream/main' into OSROnByDefaultForx64

77a2172

ghost closed this Jan 8, 2022

AndyAyersMS mentioned this pull request Feb 2, 2022

Dynamic PGO work planned for .NET 7 #64659

Closed

10 tasks

ghost locked as resolved and limited conversation to collaborators Feb 8, 2022

This pull request was closed.

Enable QJFL and OSR by default for x64 #61934

Enable QJFL and OSR by default for x64 #61934

Conversation

AndyAyersMS commented Nov 22, 2021

ghost commented Nov 22, 2021

AndyAyersMS commented Nov 22, 2021

AndyAyersMS commented Nov 22, 2021

AndyAyersMS commented Nov 22, 2021

azure-pipelines bot commented Nov 22, 2021

azure-pipelines bot commented Nov 22, 2021

AndyAyersMS commented Nov 22, 2021

AndyAyersMS commented Nov 23, 2021

AndyAyersMS commented Nov 23, 2021

AndyAyersMS commented Nov 23, 2021

AndyAyersMS commented Nov 23, 2021

azure-pipelines bot commented Nov 23, 2021

BruceForstall commented Nov 24, 2021

AndyAyersMS commented Nov 24, 2021

AndyAyersMS commented Nov 24, 2021

AndyAyersMS commented Nov 24, 2021

AndyAyersMS commented Nov 24, 2021

AndyAyersMS commented Dec 2, 2021

AndyAyersMS commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021

azure-pipelines bot commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021

azure-pipelines bot commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021

azure-pipelines bot commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021

AndyAyersMS commented Dec 3, 2021

azure-pipelines bot commented Dec 3, 2021

AndyAyersMS commented Dec 6, 2021

AndyAyersMS commented Dec 7, 2021

AndyAyersMS commented Dec 7, 2021

AndyAyersMS commented Dec 8, 2021

AndyAyersMS commented Dec 8, 2021 • edited Loading

AndyAyersMS commented Dec 9, 2021

AndyAyersMS commented Dec 9, 2021

EgorBo commented Dec 9, 2021

AndyAyersMS commented Dec 9, 2021

ghost commented Jan 8, 2022

AndyAyersMS commented Dec 8, 2021 •

edited

Loading