Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable promotion of structs passed in multiple registers with matching fields #37924

Closed
CarolEidt opened this issue Jun 15, 2020 · 6 comments · Fixed by #39326
Closed

Enable promotion of structs passed in multiple registers with matching fields #37924

CarolEidt opened this issue Jun 15, 2020 · 6 comments · Fixed by #39326
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@CarolEidt
Copy link
Contributor

CarolEidt commented Jun 15, 2020

With #36862 we have support for enregistering structs that are returned in multiple registers, but structs that are passed in multiple registers are still sometimes disabled from being enregistered, even if the number of fields matches the number of registers. This includes at least some HFAs. In particular, Compiler::StructPromotionHelper::ShouldPromoteStructVar() will only return true for a parameter if there are 1 or 2 fields, and will return false on x64/ux for some cases of 2 fields where the offset of the second field is not 8 bytes.

category:cq
theme:structs
skill-level:expert
cost:medium

@CarolEidt CarolEidt added tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jun 15, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jun 15, 2020
@CarolEidt CarolEidt removed the untriaged New issue has not been triaged by the area owner label Jun 15, 2020
@CarolEidt CarolEidt added this to the 5.0.0 milestone Jun 15, 2020
@CarolEidt CarolEidt self-assigned this Jun 18, 2020
@CarolEidt CarolEidt modified the milestones: 5.0.0, 6.0.0 Jul 15, 2020
CarolEidt added a commit to CarolEidt/runtime that referenced this issue Oct 6, 2020
Allow HFAs and other structs with matching fields and registers.

Contributes to dotnet#37924
CarolEidt added a commit that referenced this issue Oct 7, 2020
* Allow enregistering more structs args

Allow HFAs and other structs with matching fields and registers.

Contributes to #37924
@CarolEidt
Copy link
Contributor Author

Although #39326 addressed some cases, on Arm64 additional changes are needed for non-HFA multireg args.

@CarolEidt CarolEidt reopened this Oct 7, 2020
@Sergio0694
Copy link
Contributor

Hello, I've stumbled upon an issue that I think is related to this issue (also maybe #13417?), which in my case makes it so that the codegen produced by the JIT when using a struct enumerator is extremely slow, due to tons of moves back and forth from the struct instance, instead of just enregistering the various fields that are present there. Commenting here to avoid opening a possible duplicate issue, let me know if I should instead create one and I'll be happy to move this comment there 🙂

I'm working on a Span2D<T> type (here), and I have the following enumerator type. Note that here I'm using a Span<T> to hold a ref T and an int field (in this case representing the height), so the Length parameter of that Span<T> doesn't actually represent the total number of items to access - this enumerator basically traverses a 2D memory area with a given height, width and pitch):

public ref struct Enumerator // Defined within Span2D<T>
{
    private readonly Span<T> span; // target ref and height of the 2D area
    private readonly int width;
    private readonly int pitch; // distance between consecutive rows (0 if rows are contiguous)
    private int x; // initially set to -1 in constructor (omitted here)
    private int y;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public bool MoveNext()
    {
        int x = this.x + 1;

        if (x < this.width)
        {
            this.x = x;

            return true;
        }

        this.x = 0;

        return this.y++ < (this.span.Length - 1);
    }

    public readonly ref T Current
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            ref T r0 = ref MemoryMarshal.GetReference(this.span);
            nint index = ((nint)(uint)this.y * (nint)(uint)(this.width + this.pitch)) + (nint)(uint)this.x;

            return ref Unsafe.Add(ref r0, index);
        }
    }
}

I noticed using this type in a foreach loop was almost 3x slower than just manually doing two for loops and indexing the Span2D<T> type directly, so I wrote a small method taking the enumerator directly to inspect the codegen:

private static int SumForeachTest(Span2D<int>.Enumerator span)
{
    int sum = 0;

    while (span.MoveNext())
    {
        sum += span.Current;
    }

    return sum;
}

And got this result (added some comments next to the assembly as I was trying to figure out what was going on).
The JIT dump was produced with disasmo and a local checked build of CoreCLR from the master branch (from like 2 weeks ago):

; Assembly listing for method Span2DBenchmark.Benchmark:SumForeachTest(Enumerator[Int32]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 14, 96   )   byref  ->  rcx         ld-addr-op
;  V01 loc0         [V01,T04] (  4,  6   )     int  ->  rax        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T06] (  2,  4   )    long  ->   r8         "Inline stloc first use temp"
;* V04 tmp2         [V04    ] (  0,  0   )   byref  ->  zero-ref    "impAppendStmt"
;* V05 tmp3         [V05    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;  V06 tmp4         [V06,T02] (  2,  8   )    bool  ->  rdx         "Inline return value spill temp"
;  V07 tmp5         [V07,T01] (  3, 20   )     int  ->  rdx         "Inline stloc first use temp"
;  V08 tmp6         [V08,T03] (  3, 12   )     int  ->  rdx         "Inline stloc first use temp"
;  V09 tmp7         [V09,T05] (  2,  4   )   byref  ->  rdx         V05._pointer(offs=0x00) P-INDEP "field V05._pointer (fldOffset=0x0)"
;* V10 tmp8         [V10    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V05._length (fldOffset=0x8)"
;
; Lcl frame size = 0

G_M35757_IG01:
						;; bbWeight=1    PerfScore 0.00
START:
    xor      eax, eax                   ; eax = sum = 0
    jmp      SHORT MOVE_NEXT_COLUMN
						;; bbWeight=1    PerfScore 2.25

INDEXING:
    mov      rdx, bword ptr [rcx]       ; rdx = e.ByRef<T>
    mov      r8d, dword ptr [rcx+28]    ; r8d = e.y
    mov      r9d, dword ptr [rcx+16]    ; r9d = e.width
    add      r9d, dword ptr [rcx+20]    ; r9d += e.pitch
    mov      r9d, r9d                   ; (nint)(uint)r9d
    imul     r8, r9                     ; r8 = y * (e.height + e.pitch)
    mov      r9d, dword ptr [rcx+24]    ; r9d = e.x
    add      r8, r9                     ; r8 += r9
    add      eax, dword ptr [rdx+4*r8]  ; eax += *(e.ByRef<T> + r8)
						;; bbWeight=2    PerfScore 29.00
MOVE_NEXT_COLUMN:
    mov      edx, dword ptr [rcx+24]    ; edx = e.x
    inc      edx                        ; x++
    cmp      edx, dword ptr [rcx+16]    ; if (x >= e.width)
    jge      SHORT MOVE_NEXT_ROW
						;; bbWeight=8    PerfScore 42.00
MOVE_NEXT_COLUMN_OK:
    mov      dword ptr [rcx+24], edx    ; e.x = x
    jmp      SHORT INDEXING
						;; bbWeight=4    PerfScore 12.00
MOVE_NEXT_ROW:
    xor      edx, edx                   ; edx = 0
    mov      dword ptr [rcx+24], edx    ; e.x = edx
    mov      edx, dword ptr [rcx+28]    ; edx = e.y
    lea      r8d, [rdx+1]               ; r8d = edx + 1
    mov      dword ptr [rcx+28], r8d    ; e.y = r8d
    mov      r8d, dword ptr [rcx+8]     ; r8d = e.height
    dec      r8d                        ; r8d--
    cmp      edx, r8d                   ; edx < r8d ?
    setl     dl
    movzx    rdx, dl                    ; rdx = edx < r8d
    test     edx, edx                   ; if (rdx)
    jne      SHORT INDEXING
						;; bbWeight=4    PerfScore 39.00
RETURN:
    ret
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 89, prolog size 0, PerfScore 134.15, (MethodHash=327a7452) for method Span2DBenchmark.Benchmark:SumForeachTest(Enumerator[Int32]):int

I've noticed that the resulting codegen is both not enregistering the various fields but instead accessing them every single time from the input struct (so it's just reading from the stack a whole lot of times, which explains the huge slowdown), and also I guess for the same reason it's also writing back values to the input span every time it modifies them. I would've expected the JIT to be able to just decompose the struct into the available registers and stop using the stack entirely, and just use registers for the entirety of the method above.

I saw that there's some work related to stack promotion planned for .NET 6, but reading the document that seemed to be related more to small structs that could fit into one or two registers instead of much larger structs (such as this one)?

@AndyAyersMS
Copy link
Member

You are running into two different limits in the current stack promotion logic:

  • only structs with <= 4 fields can be promoted
  • struct fields must (generally) be simple types and not other structs (that is, there is no recursive promotion)

The first one is easy to experiment with. Last time I looked, it seemed like if we increased the field count limit, we would also need to re-examine the heuristics, as overall there was quite a bit of code size growth to little benefit. Might be worth retrying this and seeing if anything's changed.

The second one seemingly requires more substantial changes. I've seen this pattern of including Span in a struct in a few places including some of the framework code, so it might be worth considering that as a special case.

@CarolEidt is looking over struct related issues and can comment further on likely .Net 6 work.

@Sergio0694
Copy link
Contributor

Hey @AndyAyersMS, thanks for the info!
Glad to hear the hard limit on struct fields count is potentially the easier one to tackle - I'm only using Span<T> here to cheat and store a custom ByReference<T> + int pair of values. Once .NET 6 and C# 10 drop (#32060), I'll just rewrite that type like this:

public ref struct Enumerator
{
    private readonly ref T reference;
    private readonly int height;
    private readonly int width;
    private readonly int pitch;
    private int x;
    private int y;

    // Enumeration code...
}

So in theory if (1) that number of fields for struct promotion was raised to at least 6 and (2) ref T fields were supported too, the whole thing should work just fine in this case even without support for recursive struct promotion at all 🚀

Just my two cents here, but I think that with custom struct enumerators becoming more and more widespread, it would be worth it to lift some of these limitations to ensure that consumers relying on them to get the nice foreach syntax won't have to pay such a high price for the abstraction and syntactic sugar. Ideally this should produce codegen that should be about on par with just doing a manual pair of for loops, which are much more verbose to write and make the code more error prone as well 😊

@CarolEidt
Copy link
Contributor Author

Fixed with #43870

@Sergio0694
Copy link
Contributor

Sergio0694 commented Dec 2, 2020

@CarolEidt I just tried again (now that #43870 is merged) with the snippet I previously posted in my comment above (this one), but the codegen is the same in this case, I imagine due to those other limitations Andy mentioned. Since both this and the other linked issue (#13417) are now closed, I was wondering if it could be useful for me to either just copy that comment in another open and related issue, or just to open a new issue for that specifically? I mean, so that it'd be easier to track whenever that particular thing is fixed, as right now it'd be hard to see since this issue is closed already and that comment was not specifically related to it.

Of course, as mentioned above, assuming .NET 6 will get ref T fields, that struct will drop the Span<T> field and just use a ref T and int fields separately, which should make the promotion easier as there won't be nested structs for the JIT to handle. But was still curious if there were plans to handle cases such as this anyway. Thanks! 😊

EDIT: as a test, to exclude that Span<T> complication, I changed the span to just have a T* field and an int field. This gives me the same exact codegen, so I imagine #43870 wasn't meant to address cases with structs this large?

As in, this type (both in the T* test, and with proper ref T fields in the future), would have 6 fields:

public unsafe ref struct Enumerator<T>
    where T : unmanaged
{
    private readonly T* ptr;
    private readonly int height;
    private readonly int width;
    private readonly int pitch;
    private int x;
    private int y;
}

@ghost ghost locked as resolved and limited conversation to collaborators Jan 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants