Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop unrolling support in RyuJIT #4248

Open
AndreyAkinshin opened this issue May 12, 2015 · 8 comments
Open

Loop unrolling support in RyuJIT #4248

AndreyAkinshin opened this issue May 12, 2015 · 8 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@AndreyAkinshin
Copy link
Member

LegacyJIT-x64 can unroll some loops and transform something like

for (int i = 0; i < 1024; i++)
    Foo(i);

to something like

for (int i = 0; i < 1024; i += 4)
{
    Foo(i);
    Foo(i + 1);
    Foo(i + 2);
    Foo(i + 3);
}

Also LegacyJIT-x64 can transform small loops like

for (int i = 0; i < 4; i++)
    Foo(i);

to

Foo(0);
Foo(1);
Foo(2);
Foo(3);

I like this feature because it can increase performance in some cases.

Is it possible to implement loop unrolling in RyuJIT?

See also:

category:cq
theme:loop-opt
skill-level:expert
cost:large

@mikedn
Copy link
Contributor

mikedn commented May 13, 2015

The x86 JIT on which RyuJIT did some unrolling. As far as I can tell the code is still there but it doesn't run - see optUnrollLoops in optimizer.cpp. It doesn't run because optCanCloneLoops always returns true, probably loop cloning (new to RyuJIT) somehow interferes with the old loop unrolling code. That said, the unrolling done by the x86 JIT isn't great:

for (int i = 0; i < 3; i++)
    sum += i;

generates

inc         eax  
inc         eax  
inc         eax  

Good loop unrolling isn't trivial and I doubt that the existing unrolling code can be significantly improved.

@mattwarren
Copy link
Contributor

@mikedn

That said, the unrolling done by the x86 JIT isn't great:

for (int i = 0; i < 3; i++)
    sum += i;

generates

inc         eax  
inc         eax  
inc         eax

Forgive the dumb question (I'm trying to learn about the JIT), but what would you expect it to generate? Something like this (or whatever the correct assembly is for adding 3):

add eax, 3

Or is that too much to expect?

@AndreyAkinshin
Copy link
Member Author

@mikedn, @mattwarren

Here are the asm listings of the method

[MethodImpl(MethodImplOptions.NoInlining)]
public int Run()
{
    int sum = 0;
    for (int i = 0; i < 3; i++)
        sum += i;
    return sum;
}

for different JIT versions:

LegacyJIT-x86:

00F33562  in          al,dx  
00F33563  xor         eax,eax  
00F33565  inc         eax  
00F33566  inc         eax  
00F33567  inc         eax  
00F33568  pop         ebp  
00F33569  ret  

LegacyJIT-x64:

00007FF914114470  mov         eax,3  
00007FF914114475  ret  

RyuJIT-x64 RC:

00007FF9140F4230  xor         eax,eax  
00007FF9140F4232  xor         edx,edx  
00007FF9140F4234  add         eax,edx  
00007FF9140F4236  inc         edx  
00007FF9140F4238  cmp         edx,3  
00007FF9140F423B  jl          00007FF9140F4234  
00007FF9140F423D  ret 

@mikedn
Copy link
Contributor

mikedn commented May 13, 2015

@mattwarren Yes, add eax, 3 is expected for the code that I posted. Though the real version includes a sum = 0 so it's really mov eax, 3 as in the LegacyJIT-X64 version posted above by @AndreyAkinshin.

And in case that you wonder how come 3 increment instructions were produced: the loop got unrolled as: sum += 0; sum += 1; sum += 2;. The first addition was eliminated because it's useless and the last addition was emitted as inc eax, inc eax because this sequence is one byte shorter than add eax, 2.

@AndreyAkinshin Your LegacyJIT-x86 starts with the wrong instruction in al,dx. It's really a push ebp but you see in al,dx because the VS disassembly window has a bug. Not that it would matter.

@AndreyAkinshin
Copy link
Member Author

It's really a push ebp but you see in al,dx because the VS disassembly window has a bug.

@mikedn, Thanks, it explains a lot!

@BruceForstall
Copy link
Member

/cc @briansull @schellap

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@hez2010
Copy link
Contributor

hez2010 commented May 29, 2020

An interesting sample:

public int Run()
{
    int sum = 0;
    for (int i = 0; i < 8; i++)
        sum = i;
    return sum;
}

LegacyJIT:

    L0000: push ebp
    L0001: mov ebp, esp
    L0003: mov eax, 0x7
    L0008: pop ebp
    L0009: ret

RyuJIT:

    L0000: push ebp
    L0001: mov ebp, esp
    L0003: xor eax, eax
    L0005: lea edx, [eax+1]
    L0008: cmp edx, 8
    L000b: jl short L000f
    L000d: pop ebp
    L000e: ret
    L000f: mov eax, edx
    L0011: jmp short L0005

Anyway, hoping that loop unrolling and auto-vectorize in RyuJIT can be implemented ASAP :)

@BruceForstall
Copy link
Member

BruceForstall commented Oct 30, 2020

Only a few tests in the tree cause loop unrolling to kick in, since the current heuristic requires a constant loop over a SIMD vector length:

JIT\HardwareIntrinsics\X86\Regression\GitHub_22815\GitHub_22815_ro\GitHub_22815_ro.cmd
JIT\Performance\CodeQuality\SIMD\SeekUnroll\SeekUnroll\SeekUnroll.cmd
JIT\Regression\JitBlue\GitHub_8231\GitHub_8231\GitHub_8231.cmd
JIT\SIMD\CreateGeneric_ro\CreateGeneric_ro.cmd
JIT\SIMD\CtorFromArray_ro\CtorFromArray_ro.cmd
JIT\SIMD\VectorAbs_ro\VectorAbs_ro.cmd
JIT\SIMD\VectorAdd_ro\VectorAdd_ro.cmd
JIT\SIMD\VectorArray_ro\VectorArray_ro.cmd
JIT\SIMD\VectorCeilFloor_ro\VectorCeilFloor_ro.cmd
JIT\SIMD\VectorDiv_ro\VectorDiv_ro.cmd
JIT\SIMD\VectorGet_ro\VectorGet_ro.cmd
JIT\SIMD\VectorHWAccel_ro\VectorHWAccel_ro.cmd
JIT\SIMD\VectorHWAccel2_ro\VectorHWAccel2_ro.cmd
JIT\SIMD\VectorMax_ro\VectorMax_ro.cmd
JIT\SIMD\VectorMin_ro\VectorMin_ro.cmd
JIT\SIMD\VectorMul_ro\VectorMul_ro.cmd
JIT\SIMD\VectorReturn_ro\VectorReturn_ro.cmd
JIT\SIMD\VectorSub_ro\VectorSub_ro.cmd

with

COMPlus_JitStressModeNames=STRESS_UNROLL_LOOPS

(and COMPlus_TieredCompilation=0), which allows unrolling for any counted loop (not just with SIMD element count bounds), there are 331 tests that unroll a loop, but many unroll in a duplicate function, such as System.SpanHelpers:LastIndexOf().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants