JIT: examples where loop cloning is not useful #8558

AndyAyersMS · 2017-07-15T01:17:16Z

In some cases loops are cloned with no performance benefit, and it would be better to not clone the loop at all, as the jit would run faster and produce smaller code.

For instance in benchi\iniarray, the two cloned copies are identical (in fact the array length is known and you can see there are no bounds checks):

G_M24601_IG03:
       33C9                 xor      ecx, ecx
       83780810             cmp      dword ptr [rax+8], 16
       7C15                 jl       SHORT G_M24601_IG05

G_M24601_IG04:
       4C63C1               movsxd   r8, ecx
       6642C74440102000     mov      word  ptr [rax+2*r8+16], 32
       FFC1                 inc      ecx
       83F910               cmp      ecx, 16
       7CEE                 jl       SHORT G_M24601_IG04
       EB14                 jmp      SHORT G_M24601_IG06

G_M24601_IG05:
       4C63C1               movsxd   r8, ecx
       6642C74440102000     mov      word  ptr [rax+2*r8+16], 32
       FFC1                 inc      ecx
       83F910               cmp      ecx, 16
       7CEE                 jl       SHORT G_M24601_IG05

G_M24601_IG06:
       FFC2                 inc      edx
       81FA80969800         cmp      edx, 0x989680
       7CC8                 jl       SHORT G_M24601_IG03

category:cq
theme:loop-opt
skill-level:expert
cost:medium

mikedn · 2017-07-15T05:39:06Z

Loop cloning has all kinds of issues, some are mentioned in #4922 (via #4929).

mikedn · 2017-12-18T16:59:24Z

Similar example, with a local variable instead of a constant:

static int Test(int[] a)
{
    int s = 0;
    int l = a.Length;
    for (int i = 0; i < l; i++)
        s += a[i];
    return s;
}

Picked up from actual corelib code:
https://github.com/dotnet/coreclr/blob/ff440aaee646319417522c11b34ab4fb668f4894/src/mscorlib/src/System/AppDomain.cs#L518-L526

AndyAyersMS · 2018-03-13T17:32:09Z

The bytemark ludcmp case is particularly instructive.

Loop cloning looks like it runs in lexical order of loops, which generally means preorder over the loop nesting tree. In ludcmp we have 9 loops total with the following sort of nesting structure:

     for1
        for2
     for3
        for4
           for5
        for6
           for7
        for8
        for9

because of top-down cloning we end up with 2 copies of each top level loop body, 3 copies of each nested loop, and 4 copies of each doubly-nested loop:

   choose1
     for1.c
       choose2
         for2.c
         for2.nc
     for1.nc
       for2.nc
   choose3     
     for3.c
       choose4
         for4.c
           choose5
             for5.c
             for5.nc
         for4.nc
           for5.nc
       choose6
         for6.c
           choose7
             for7.c
             for7.nc
         for6.nc
           for7.nc
       choose8
         for8.c
         for8.nc
       for9
     for3.nc
       for4.nc
         for5.nc
       for6.nc
          for7.nc
       for8.nc
          for9.nc

In this case there is a lot of imperfect nesting (outer loops have computation as well as invoking inner loops) and so a ton of code duplication for what is likely even in best case minimal gains. And in this case actual losses.

So potential code growth from cloning is apparently quadratic in nesting depth. It might be amusing to create a very deep nest just to see how insane this becomes.

If we think loop cloning is a good idea it should run bottom up as the inner loops are almost certainly hotter than the outer ones, and there should be some size/speed estimation in the model that would eventually halt cloning as it works its way up the loop nesting tree before it gets to this crazy level of duplication.

AndyAyersMS · 2018-07-27T19:04:43Z

Another example with some odd behavior. I'd expect F1 and F2 to produce similar code, but F1 has cloning and F2 doesn't (likewise for F4 and F5).

using System;
using System.Runtime.CompilerServices;

public struct S
{
    public int[] a;
    public int c;
}

public class X
{
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int F0(ref S s)
    {
        int sum = 0;

        for (int i = 0; i < s.c; i++)
        {
            sum += s.a[i];
        }

        return sum;

    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int F1(ref S s)
    {
        S sl = s;

        int sum = 0;

        for (int i = 0; i < sl.c; i++)
        {
            sum += sl.a[i];
        }

        return sum;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int F2(ref S s)
    {
        S sl = s;
        int[] al = sl.a;

        int sum = 0;

        for (int i = 0; i < s.c; i++)
        {
            sum += al[i];
        }

        return sum;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int F3(S s)
    {
        int sum = 0;

        for (int i = 0; i < s.c; i++)
        {
            sum += s.a[i];
        }

        return sum;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int F4(S s)
    {
        S sl = s;

        int sum = 0;

        for (int i = 0; i < sl.c; i++)
        {
            sum += sl.a[i];
        }

        return sum;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int F5(S s)
    {
        S sl = s;
        int[] al = sl.a;

        int sum = 0;

        for (int i = 0; i < s.c; i++)
        {
            sum += al[i];
        }

        return sum;
    }

    public static int Main()
    {
        S s = new S();
        s.a = new int[] { 0, 1, 2, 3, 4, 5 };
        s.c = 4;
        
        int s0 = F0(ref s);
        int s1 = F1(ref s);
        int s2 = F2(ref s);
        int s3 = F3(s);
        int s4 = F4(s);
        int s5 = F5(s);

        return s0 + s1 + s2 + s3 + s4 + s5 + 64;
    }
}

Also thinking if we really can split the two versions as "won't throw / must throw" then the latter should be moved to the cold part of the method.

If it's "won't throw / may throw" then perhaps not...

EgorBo · 2020-08-20T16:19:11Z

@AndyAyersMS Proposal: lower limit for constant-size loops: EgorBo@0efefa1
Example:

public static int SumOfFirstThreeElements(int[] arr)
{
    int sum = 0;
    for (int i = 0; i < 3; i++)
        sum += arr[i];
    return sum;
}

With loop-cloning:

; Method Program:SumOfFirstThreeElements(System.Int32[]):int
G_M22200_IG01:
       4883EC28             sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25

G_M22200_IG02:
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx
       4885C9               test     rcx, rcx
       7417                 je       SHORT G_M22200_IG06
						;; bbWeight=1    PerfScore 1.75

G_M22200_IG03:
       83790803             cmp      dword ptr [rcx+8], 3
       7C11                 jl       SHORT G_M22200_IG06
						;; bbWeight=1    PerfScore 3.00

G_M22200_IG04:
       4C63C2               movsxd   r8, edx
       4203448110           add      eax, dword ptr [rcx+4*r8+16]
       FFC2                 inc      edx
       83FA03               cmp      edx, 3
       7CF1                 jl       SHORT G_M22200_IG04
						;; bbWeight=4    PerfScore 15.00

G_M22200_IG05:
       EB14                 jmp      SHORT G_M22200_IG07
						;; bbWeight=1    PerfScore 2.00

G_M22200_IG06:
       3B5108               cmp      edx, dword ptr [rcx+8]
       7314                 jae      SHORT G_M22200_IG08
       4C63C2               movsxd   r8, edx
       4203448110           add      eax, dword ptr [rcx+4*r8+16]
       FFC2                 inc      edx
       83FA03               cmp      edx, 3
       7CEC                 jl       SHORT G_M22200_IG06
						;; bbWeight=4    PerfScore 27.00

G_M22200_IG07:
       4883C428             add      rsp, 40
       C3                   ret      
						;; bbWeight=1    PerfScore 1.25

G_M22200_IG08:
       E81E78755F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     
						;; bbWeight=0    PerfScore 0.00
; Total bytes of code: 67

Without loop-cloning for small constant-size loops (by small I mean amount of iterations)

; Method Program:SumOfFirstThreeElements(System.Int32[]):int
G_M22200_IG01:
       4883EC28             sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25

G_M22200_IG02:
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx
       448B4108             mov      r8d, dword ptr [rcx+8]
						;; bbWeight=1    PerfScore 2.50

G_M22200_IG03:
       413BD0               cmp      edx, r8d
       7314                 jae      SHORT G_M22200_IG05
       4C63CA               movsxd   r9, edx
       4203448910           add      eax, dword ptr [rcx+4*r9+16]
       FFC2                 inc      edx
       83FA03               cmp      edx, 3
       7CEC                 jl       SHORT G_M22200_IG03
						;; bbWeight=4    PerfScore 20.00

G_M22200_IG04:
       4883C428             add      rsp, 40
       C3                   ret      
						;; bbWeight=1    PerfScore 1.25

G_M22200_IG05:
       E8B679765F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     
						;; bbWeight=0    PerfScore 0.00
; Total bytes of code: 43

EgorBo · 2020-08-31T10:38:20Z

Suggestion: don't clone loops with unmanaged calls inside - it leads to a huge size increase, e.g.

[DllImport("Kernel32")]
//[SuppressGCTransition]
static extern int GetTickCount();

static void Test(int[] array)
{
    for (int i = 0; i < 100; i++) 
        DoWork(array[i] + GetTickCount());
}

[MethodImpl(MethodImplOptions.NoInlining)]
static void DoWork(int t) {}

; Method Program:Test(System.Int32[])
G_M38108_IG01:
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       4155                 push     r13
       4154                 push     r12
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC68             sub      rsp, 104
       488DAC24A0000000     lea      rbp, [rsp+A0H]
       488BF1               mov      rsi, rcx
						;; bbWeight=1    PerfScore 9.00

G_M38108_IG02:
       488D4D88             lea      rcx, [rbp-78H]
       498BD2               mov      rdx, r10
       E8B9B8765F           call     CORINFO_HELP_INIT_PINVOKE_FRAME
       488BF8               mov      rdi, rax
       488BC4               mov      rax, rsp
       488945A8             mov      qword ptr [rbp-58H], rax
       488BC5               mov      rax, rbp
       488945B8             mov      qword ptr [rbp-48H], rax
       33DB                 xor      ebx, ebx
       4885F6               test     rsi, rsi
       746E                 je       SHORT G_M38108_IG09
						;; bbWeight=1    PerfScore 6.00

G_M38108_IG03:
       837E0864             cmp      dword ptr [rsi+8], 100
       7C68                 jl       SHORT G_M38108_IG09
						;; bbWeight=1    PerfScore 3.00

G_M38108_IG04:
       4863C3               movsxd   rax, ebx
       48897510             mov      gword ptr [rbp+10H], rsi
       448B748610           mov      r14d, dword ptr [rsi+4*rax+16]
       48B818545785FC7F0000 mov      rax, 0x7FFC85575418
       48894598             mov      qword ptr [rbp-68H], rax
       488D0516000000       lea      rax, G_M38108_IG06
       488945B0             mov      qword ptr [rbp-50H], rax
       488D4588             lea      rax, bword ptr [rbp-78H]
       48894710             mov      qword ptr [rdi+16], rax
       C6470C00             mov      byte  ptr [rdi+12], 0
						;; bbWeight=4    PerfScore 36.00

G_M38108_IG05:
       FF1504802400         call     [Program:GetTickCount():int]
						;; bbWeight=4    PerfScore 12.00

G_M38108_IG06:
       C6470C01             mov      byte  ptr [rdi+12], 1
       833D8998F75F00       cmp      dword ptr [(reloc)], 0
       7406                 je       SHORT G_M38108_IG07
       FF1591DBF55F         call     [CORINFO_HELP_STOP_FOR_GC]
						;; bbWeight=4    PerfScore 28.00

G_M38108_IG07:
       488B4D90             mov      rcx, bword ptr [rbp-70H]
       48894F10             mov      qword ptr [rdi+16], rcx
       428D0C30             lea      ecx, [rax+r14]
       E8C881FEFF           call     Program:DoWork(int)
       FFC3                 inc      ebx
       83FB64               cmp      ebx, 100
       488B7510             mov      rsi, gword ptr [rbp+10H]
       7C9A                 jl       SHORT G_M38108_IG04
						;; bbWeight=4    PerfScore 24.00

G_M38108_IG08:
       EB6B                 jmp      SHORT G_M38108_IG13
						;; bbWeight=1    PerfScore 2.00

G_M38108_IG09:
       3B5E08               cmp      ebx, dword ptr [rsi+8]
       7377                 jae      SHORT G_M38108_IG14
       4863C3               movsxd   rax, ebx
       48897510             mov      gword ptr [rbp+10H], rsi
       448B748610           mov      r14d, dword ptr [rsi+4*rax+16]
       48B818545785FC7F0000 mov      rax, 0x7FFC85575418
       48894598             mov      qword ptr [rbp-68H], rax
       488D0516000000       lea      rax, G_M38108_IG11
       488945B0             mov      qword ptr [rbp-50H], rax
       488D4588             lea      rax, bword ptr [rbp-78H]
       48894710             mov      qword ptr [rdi+16], rax
       C6470C00             mov      byte  ptr [rdi+12], 0
						;; bbWeight=4    PerfScore 48.00

G_M38108_IG10:
       FF15977F2400         call     [Program:GetTickCount():int]
						;; bbWeight=4    PerfScore 12.00

G_M38108_IG11:
       C6470C01             mov      byte  ptr [rdi+12], 1
       833D1C98F75F00       cmp      dword ptr [(reloc)], 0
       7406                 je       SHORT G_M38108_IG12
       FF1524DBF55F         call     [CORINFO_HELP_STOP_FOR_GC]
						;; bbWeight=4    PerfScore 28.00

G_M38108_IG12:
       488B4D90             mov      rcx, bword ptr [rbp-70H]
       48894F10             mov      qword ptr [rdi+16], rcx
       428D0C30             lea      ecx, [rax+r14]
       E85B81FEFF           call     Program:DoWork(int)
       FFC3                 inc      ebx
       83FB64               cmp      ebx, 100
       488B7510             mov      rsi, gword ptr [rbp+10H]
       7C95                 jl       SHORT G_M38108_IG09
						;; bbWeight=4    PerfScore 24.00

G_M38108_IG13:
       488D65C8             lea      rsp, [rbp-38H]
       5B                   pop      rbx
       5E                   pop      rsi
       5F                   pop      rdi
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret      
						;; bbWeight=1    PerfScore 5.50

G_M38108_IG14:
       E83281765F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     
						;; bbWeight=0    PerfScore 0.00
; Total bytes of code: 303

AndyAyersMS · 2020-10-21T16:29:59Z

Another small trip count example (from this stack overflow issue)

using System;

public class C {
    static void M1(int[] left, int[] right)
    {
        for (int i = 0; i < 5; i++)
        {
            left[i] = 1;
            right[i] = 1;
        }
    }  
    
    static void M2(int[] left, int[] right)
    {
        for (int i = 0; i < 10; i+=2)
        {
            left[i] = 1;
            right[i] = 1;
        }
    } 
}

We only clone the loop in M1. The preamble to get to the cloned loop is pretty costly, first we have to verify neither array is null, then that all array accesses are in bounds; as a result we end up with slower code.

G_M32658_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M32658_IG02:
       xor      eax, eax
       test     rcx, rcx
       setne    r8b
       movzx    r8, r8b
       test     rdx, rdx
       setne    r9b
       movzx    r9, r9b
       test     r8d, r9d
       je       SHORT G_M32658_IG06
						;; bbWeight=1    PerfScore 4.50
G_M32658_IG03:
       cmp      dword ptr [rcx+8], 5
       setge    r8b
       movzx    r8, r8b
       cmp      dword ptr [rdx+8], 5
       setge    r9b
       movzx    r9, r9b
       test     r8d, r9d
       je       SHORT G_M32658_IG06
						;; bbWeight=1    PerfScore 7.75
G_M32658_IG04:
       movsxd   r8, eax
       mov      dword ptr [rcx+4*r8+16], 1
       mov      dword ptr [rdx+4*r8+16], 1
       inc      eax
       cmp      eax, 5
       jl       SHORT G_M32658_IG04

Not sure why we materialize all those bools and then test them, seems like we should perhaps just branch after each test.

jakobbotsch · 2024-08-16T10:44:32Z

#96851 (comment) has another example in regex of a questionable cloned loop.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

AndyAyersMS mentioned this issue Feb 19, 2020

[Perf][x86] Investigate performance regression in FannkuchRedux_2 and FannkuchRedux_5 #9836

Closed

AndyAyersMS mentioned this issue Aug 20, 2020

Loop Unrolling is not Enabled in Release Build #41063

Closed

BruceForstall mentioned this issue Oct 17, 2020

Improve JIT loop optimizations (.NET 6) #43549

Closed

25 tasks

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Jan 15, 2021

BruceForstall modified the milestones: Future, 6.0.0 Jan 15, 2021

JulieLeeMSFT assigned BruceForstall Mar 23, 2021

JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021

BruceForstall removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 8, 2021

BruceForstall mentioned this issue May 13, 2021

Duplicated loop body #52586

Closed

BruceForstall modified the milestones: 6.0.0, 7.0.0 Jul 6, 2021

BruceForstall mentioned this issue Jul 6, 2021

Improve JIT loop optimizations (.NET 7) #55235

Closed

5 tasks

BruceForstall mentioned this issue Feb 15, 2022

Improve JIT loop optimizations #65342

Open

24 tasks

BruceForstall modified the milestones: 7.0.0, 8.0.0 May 26, 2022

BruceForstall modified the milestones: 8.0.0, 9.0.0 Jul 7, 2023

This was referenced Jan 5, 2024

JIT: Remove BBJ_RETURN loop cloning quirk #96555

Merged

JIT: Avoid cloning cold descendant loops quadratically #96580

Closed

JulieLeeMSFT added the Priority:2 Work that is important, but not critical for the release label May 7, 2024

JulieLeeMSFT removed the Priority:2 Work that is important, but not critical for the release label Jun 13, 2024

JulieLeeMSFT modified the milestones: 9.0.0, 10.0.0 Jul 25, 2024

jakobbotsch mentioned this issue Aug 16, 2024

[Perf] Linux/arm64: 3 Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock #96851

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: examples where loop cloning is not useful #8558

JIT: examples where loop cloning is not useful #8558

AndyAyersMS commented Jul 15, 2017

mikedn commented Jul 15, 2017

mikedn commented Dec 18, 2017

AndyAyersMS commented Mar 13, 2018

AndyAyersMS commented Jul 27, 2018

EgorBo commented Aug 20, 2020 •

edited

Loading

EgorBo commented Aug 31, 2020 •

edited

Loading

AndyAyersMS commented Oct 21, 2020

jakobbotsch commented Aug 16, 2024

JIT: examples where loop cloning is not useful #8558

JIT: examples where loop cloning is not useful #8558

Comments

AndyAyersMS commented Jul 15, 2017

mikedn commented Jul 15, 2017

mikedn commented Dec 18, 2017

AndyAyersMS commented Mar 13, 2018

AndyAyersMS commented Jul 27, 2018

EgorBo commented Aug 20, 2020 • edited Loading

EgorBo commented Aug 31, 2020 • edited Loading

AndyAyersMS commented Oct 21, 2020

jakobbotsch commented Aug 16, 2024

EgorBo commented Aug 20, 2020 •

edited

Loading

EgorBo commented Aug 31, 2020 •

edited

Loading