Loop unrolling support in RyuJIT #4248

AndreyAkinshin · 2015-05-12T23:04:39Z

LegacyJIT-x64 can unroll some loops and transform something like

for (int i = 0; i < 1024; i++)
    Foo(i);

to something like

for (int i = 0; i < 1024; i += 4)
{
    Foo(i);
    Foo(i + 1);
    Foo(i + 2);
    Foo(i + 3);
}

Also LegacyJIT-x64 can transform small loops like

for (int i = 0; i < 4; i++)
    Foo(i);

to

Foo(0);
Foo(1);
Foo(2);
Foo(3);

I like this feature because it can increase performance in some cases.

Is it possible to implement loop unrolling in RyuJIT?

See also:

category:cq
theme:loop-opt
skill-level:expert
cost:large

mikedn · 2015-05-13T07:37:26Z

The x86 JIT on which RyuJIT did some unrolling. As far as I can tell the code is still there but it doesn't run - see optUnrollLoops in optimizer.cpp. It doesn't run because optCanCloneLoops always returns true, probably loop cloning (new to RyuJIT) somehow interferes with the old loop unrolling code. That said, the unrolling done by the x86 JIT isn't great:

for (int i = 0; i < 3; i++)
    sum += i;

generates

inc         eax  
inc         eax  
inc         eax

Good loop unrolling isn't trivial and I doubt that the existing unrolling code can be significantly improved.

mattwarren · 2015-05-13T16:54:09Z

@mikedn

That said, the unrolling done by the x86 JIT isn't great:
for (int i = 0; i < 3; i++)
    sum += i;
generates
inc         eax  
inc         eax  
inc         eax

Forgive the dumb question (I'm trying to learn about the JIT), but what would you expect it to generate? Something like this (or whatever the correct assembly is for adding 3):

add eax, 3

Or is that too much to expect?

AndreyAkinshin · 2015-05-13T17:19:49Z

@mikedn, @mattwarren

Here are the asm listings of the method

[MethodImpl(MethodImplOptions.NoInlining)]
public int Run()
{
    int sum = 0;
    for (int i = 0; i < 3; i++)
        sum += i;
    return sum;
}

for different JIT versions:

LegacyJIT-x86:

00F33562  in          al,dx  
00F33563  xor         eax,eax  
00F33565  inc         eax  
00F33566  inc         eax  
00F33567  inc         eax  
00F33568  pop         ebp  
00F33569  ret

LegacyJIT-x64:

00007FF914114470  mov         eax,3  
00007FF914114475  ret

RyuJIT-x64 RC:

00007FF9140F4230  xor         eax,eax  
00007FF9140F4232  xor         edx,edx  
00007FF9140F4234  add         eax,edx  
00007FF9140F4236  inc         edx  
00007FF9140F4238  cmp         edx,3  
00007FF9140F423B  jl          00007FF9140F4234  
00007FF9140F423D  ret

mikedn · 2015-05-13T17:30:56Z

@mattwarren Yes, add eax, 3 is expected for the code that I posted. Though the real version includes a sum = 0 so it's really mov eax, 3 as in the LegacyJIT-X64 version posted above by @AndreyAkinshin.

And in case that you wonder how come 3 increment instructions were produced: the loop got unrolled as: sum += 0; sum += 1; sum += 2;. The first addition was eliminated because it's useless and the last addition was emitted as inc eax, inc eax because this sequence is one byte shorter than add eax, 2.

@AndreyAkinshin Your LegacyJIT-x86 starts with the wrong instruction in al,dx. It's really a push ebp but you see in al,dx because the VS disassembly window has a bug. Not that it would matter.

AndreyAkinshin · 2015-05-13T17:39:57Z

It's really a push ebp but you see in al,dx because the VS disassembly window has a bug.

@mikedn, Thanks, it explains a lot!

BruceForstall · 2015-05-13T20:47:35Z

/cc @briansull @schellap

hez2010 · 2020-05-29T14:06:58Z

An interesting sample:

public int Run()
{
    int sum = 0;
    for (int i = 0; i < 8; i++)
        sum = i;
    return sum;
}

LegacyJIT:

    L0000: push ebp
    L0001: mov ebp, esp
    L0003: mov eax, 0x7
    L0008: pop ebp
    L0009: ret

RyuJIT:

    L0000: push ebp
    L0001: mov ebp, esp
    L0003: xor eax, eax
    L0005: lea edx, [eax+1]
    L0008: cmp edx, 8
    L000b: jl short L000f
    L000d: pop ebp
    L000e: ret
    L000f: mov eax, edx
    L0011: jmp short L0005

Anyway, hoping that loop unrolling and auto-vectorize in RyuJIT can be implemented ASAP :)

BruceForstall · 2020-10-30T00:15:52Z

Only a few tests in the tree cause loop unrolling to kick in, since the current heuristic requires a constant loop over a SIMD vector length:

JIT\HardwareIntrinsics\X86\Regression\GitHub_22815\GitHub_22815_ro\GitHub_22815_ro.cmd
JIT\Performance\CodeQuality\SIMD\SeekUnroll\SeekUnroll\SeekUnroll.cmd
JIT\Regression\JitBlue\GitHub_8231\GitHub_8231\GitHub_8231.cmd
JIT\SIMD\CreateGeneric_ro\CreateGeneric_ro.cmd
JIT\SIMD\CtorFromArray_ro\CtorFromArray_ro.cmd
JIT\SIMD\VectorAbs_ro\VectorAbs_ro.cmd
JIT\SIMD\VectorAdd_ro\VectorAdd_ro.cmd
JIT\SIMD\VectorArray_ro\VectorArray_ro.cmd
JIT\SIMD\VectorCeilFloor_ro\VectorCeilFloor_ro.cmd
JIT\SIMD\VectorDiv_ro\VectorDiv_ro.cmd
JIT\SIMD\VectorGet_ro\VectorGet_ro.cmd
JIT\SIMD\VectorHWAccel_ro\VectorHWAccel_ro.cmd
JIT\SIMD\VectorHWAccel2_ro\VectorHWAccel2_ro.cmd
JIT\SIMD\VectorMax_ro\VectorMax_ro.cmd
JIT\SIMD\VectorMin_ro\VectorMin_ro.cmd
JIT\SIMD\VectorMul_ro\VectorMul_ro.cmd
JIT\SIMD\VectorReturn_ro\VectorReturn_ro.cmd
JIT\SIMD\VectorSub_ro\VectorSub_ro.cmd

with

COMPlus_JitStressModeNames=STRESS_UNROLL_LOOPS

(and COMPlus_TieredCompilation=0), which allows unrolling for any counted loop (not just with SIMD element count bounds), there are 331 tests that unroll a loop, but many unroll in a duplicate function, such as System.SpanHelpers:LastIndexOf().

msftgits transferred this issue from dotnet/coreclr Jan 30, 2020

msftgits added this to the Future milestone Jan 30, 2020

abelbraaksma mentioned this issue Jun 18, 2020

Improve perf or String.iter and String.iteri by 25% dotnet/fsharp#9497

Closed

AndyAyersMS mentioned this issue Aug 20, 2020

Loop Unrolling is not Enabled in Release Build #41063

Closed

BruceForstall mentioned this issue Oct 17, 2020

Improve JIT loop optimizations (.NET 6) #43549

Closed

25 tasks

BruceForstall added JitUntriaged CLR JIT issues needing additional triage and removed JitUntriaged CLR JIT issues needing additional triage labels Oct 28, 2020

BruceForstall mentioned this issue Jul 6, 2021

Improve JIT loop optimizations (.NET 7) #55235

Closed

5 tasks

BruceForstall mentioned this issue Feb 15, 2022

Improve JIT loop optimizations #65342

Open

24 tasks

ddrinka mentioned this issue Jul 18, 2022

Optimized reader reviewed ddrinka/ApacheOrcDotNet#9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loop unrolling support in RyuJIT #4248

Loop unrolling support in RyuJIT #4248

AndreyAkinshin commented May 12, 2015

mikedn commented May 13, 2015

mattwarren commented May 13, 2015

AndreyAkinshin commented May 13, 2015

mikedn commented May 13, 2015

AndreyAkinshin commented May 13, 2015

BruceForstall commented May 13, 2015

hez2010 commented May 29, 2020 •

edited

Loading

BruceForstall commented Oct 30, 2020 •

edited

Loading

Loop unrolling support in RyuJIT #4248

Loop unrolling support in RyuJIT #4248

Comments

AndreyAkinshin commented May 12, 2015

mikedn commented May 13, 2015

mattwarren commented May 13, 2015

AndreyAkinshin commented May 13, 2015

mikedn commented May 13, 2015

AndreyAkinshin commented May 13, 2015

BruceForstall commented May 13, 2015

hez2010 commented May 29, 2020 • edited Loading

BruceForstall commented Oct 30, 2020 • edited Loading

hez2010 commented May 29, 2020 •

edited

Loading

BruceForstall commented Oct 30, 2020 •

edited

Loading