Enable promotion of structs passed in multiple registers with matching fields #37924

CarolEidt · 2020-06-15T20:46:19Z

With #36862 we have support for enregistering structs that are returned in multiple registers, but structs that are passed in multiple registers are still sometimes disabled from being enregistered, even if the number of fields matches the number of registers. This includes at least some HFAs. In particular, Compiler::StructPromotionHelper::ShouldPromoteStructVar() will only return true for a parameter if there are 1 or 2 fields, and will return false on x64/ux for some cases of 2 fields where the offset of the second field is not 8 bytes.

category:cq
theme:structs
skill-level:expert
cost:medium

The text was updated successfully, but these errors were encountered:

Allow HFAs and other structs with matching fields and registers. Contributes to dotnet#37924

* Allow enregistering more structs args Allow HFAs and other structs with matching fields and registers. Contributes to #37924

CarolEidt · 2020-10-07T17:54:10Z

Although #39326 addressed some cases, on Arm64 additional changes are needed for non-HFA multireg args.

Sergio0694 · 2020-10-24T23:21:01Z

Hello, I've stumbled upon an issue that I think is related to this issue (also maybe #13417?), which in my case makes it so that the codegen produced by the JIT when using a struct enumerator is extremely slow, due to tons of moves back and forth from the struct instance, instead of just enregistering the various fields that are present there. Commenting here to avoid opening a possible duplicate issue, let me know if I should instead create one and I'll be happy to move this comment there 🙂

I'm working on a Span2D<T> type (here), and I have the following enumerator type. Note that here I'm using a Span<T> to hold a ref T and an int field (in this case representing the height), so the Length parameter of that Span<T> doesn't actually represent the total number of items to access - this enumerator basically traverses a 2D memory area with a given height, width and pitch):

public ref struct Enumerator // Defined within Span2D<T>
{
    private readonly Span<T> span; // target ref and height of the 2D area
    private readonly int width;
    private readonly int pitch; // distance between consecutive rows (0 if rows are contiguous)
    private int x; // initially set to -1 in constructor (omitted here)
    private int y;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public bool MoveNext()
    {
        int x = this.x + 1;

        if (x < this.width)
        {
            this.x = x;

            return true;
        }

        this.x = 0;

        return this.y++ < (this.span.Length - 1);
    }

    public readonly ref T Current
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            ref T r0 = ref MemoryMarshal.GetReference(this.span);
            nint index = ((nint)(uint)this.y * (nint)(uint)(this.width + this.pitch)) + (nint)(uint)this.x;

            return ref Unsafe.Add(ref r0, index);
        }
    }
}

I noticed using this type in a foreach loop was almost 3x slower than just manually doing two for loops and indexing the Span2D<T> type directly, so I wrote a small method taking the enumerator directly to inspect the codegen:

private static int SumForeachTest(Span2D<int>.Enumerator span)
{
    int sum = 0;

    while (span.MoveNext())
    {
        sum += span.Current;
    }

    return sum;
}

And got this result (added some comments next to the assembly as I was trying to figure out what was going on).
The JIT dump was produced with disasmo and a local checked build of CoreCLR from the master branch (from like 2 weeks ago):

; Assembly listing for method Span2DBenchmark.Benchmark:SumForeachTest(Enumerator[Int32]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] ( 14, 96   )   byref  ->  rcx         ld-addr-op
;  V01 loc0         [V01,T04] (  4,  6   )     int  ->  rax        
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V03 tmp1         [V03,T06] (  2,  4   )    long  ->   r8         "Inline stloc first use temp"
;* V04 tmp2         [V04    ] (  0,  0   )   byref  ->  zero-ref    "impAppendStmt"
;* V05 tmp3         [V05    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;  V06 tmp4         [V06,T02] (  2,  8   )    bool  ->  rdx         "Inline return value spill temp"
;  V07 tmp5         [V07,T01] (  3, 20   )     int  ->  rdx         "Inline stloc first use temp"
;  V08 tmp6         [V08,T03] (  3, 12   )     int  ->  rdx         "Inline stloc first use temp"
;  V09 tmp7         [V09,T05] (  2,  4   )   byref  ->  rdx         V05._pointer(offs=0x00) P-INDEP "field V05._pointer (fldOffset=0x0)"
;* V10 tmp8         [V10    ] (  0,  0   )     int  ->  zero-ref    V05._length(offs=0x08) P-INDEP "field V05._length (fldOffset=0x8)"
;
; Lcl frame size = 0

G_M35757_IG01:
						;; bbWeight=1    PerfScore 0.00
START:
    xor      eax, eax                   ; eax = sum = 0
    jmp      SHORT MOVE_NEXT_COLUMN
						;; bbWeight=1    PerfScore 2.25

INDEXING:
    mov      rdx, bword ptr [rcx]       ; rdx = e.ByRef<T>
    mov      r8d, dword ptr [rcx+28]    ; r8d = e.y
    mov      r9d, dword ptr [rcx+16]    ; r9d = e.width
    add      r9d, dword ptr [rcx+20]    ; r9d += e.pitch
    mov      r9d, r9d                   ; (nint)(uint)r9d
    imul     r8, r9                     ; r8 = y * (e.height + e.pitch)
    mov      r9d, dword ptr [rcx+24]    ; r9d = e.x
    add      r8, r9                     ; r8 += r9
    add      eax, dword ptr [rdx+4*r8]  ; eax += *(e.ByRef<T> + r8)
						;; bbWeight=2    PerfScore 29.00
MOVE_NEXT_COLUMN:
    mov      edx, dword ptr [rcx+24]    ; edx = e.x
    inc      edx                        ; x++
    cmp      edx, dword ptr [rcx+16]    ; if (x >= e.width)
    jge      SHORT MOVE_NEXT_ROW
						;; bbWeight=8    PerfScore 42.00
MOVE_NEXT_COLUMN_OK:
    mov      dword ptr [rcx+24], edx    ; e.x = x
    jmp      SHORT INDEXING
						;; bbWeight=4    PerfScore 12.00
MOVE_NEXT_ROW:
    xor      edx, edx                   ; edx = 0
    mov      dword ptr [rcx+24], edx    ; e.x = edx
    mov      edx, dword ptr [rcx+28]    ; edx = e.y
    lea      r8d, [rdx+1]               ; r8d = edx + 1
    mov      dword ptr [rcx+28], r8d    ; e.y = r8d
    mov      r8d, dword ptr [rcx+8]     ; r8d = e.height
    dec      r8d                        ; r8d--
    cmp      edx, r8d                   ; edx < r8d ?
    setl     dl
    movzx    rdx, dl                    ; rdx = edx < r8d
    test     edx, edx                   ; if (rdx)
    jne      SHORT INDEXING
						;; bbWeight=4    PerfScore 39.00
RETURN:
    ret
						;; bbWeight=1    PerfScore 1.00

; Total bytes of code 89, prolog size 0, PerfScore 134.15, (MethodHash=327a7452) for method Span2DBenchmark.Benchmark:SumForeachTest(Enumerator[Int32]):int

I've noticed that the resulting codegen is both not enregistering the various fields but instead accessing them every single time from the input struct (so it's just reading from the stack a whole lot of times, which explains the huge slowdown), and also I guess for the same reason it's also writing back values to the input span every time it modifies them. I would've expected the JIT to be able to just decompose the struct into the available registers and stop using the stack entirely, and just use registers for the entirety of the method above.

I saw that there's some work related to stack promotion planned for .NET 6, but reading the document that seemed to be related more to small structs that could fit into one or two registers instead of much larger structs (such as this one)?

AndyAyersMS · 2020-10-25T16:57:48Z

You are running into two different limits in the current stack promotion logic:

only structs with <= 4 fields can be promoted
struct fields must (generally) be simple types and not other structs (that is, there is no recursive promotion)

The first one is easy to experiment with. Last time I looked, it seemed like if we increased the field count limit, we would also need to re-examine the heuristics, as overall there was quite a bit of code size growth to little benefit. Might be worth retrying this and seeing if anything's changed.

The second one seemingly requires more substantial changes. I've seen this pattern of including Span in a struct in a few places including some of the framework code, so it might be worth considering that as a special case.

@CarolEidt is looking over struct related issues and can comment further on likely .Net 6 work.

Sergio0694 · 2020-10-25T17:39:32Z

Hey @AndyAyersMS, thanks for the info!
Glad to hear the hard limit on struct fields count is potentially the easier one to tackle - I'm only using Span<T> here to cheat and store a custom ByReference<T> + int pair of values. Once .NET 6 and C# 10 drop (#32060), I'll just rewrite that type like this:

public ref struct Enumerator
{
    private readonly ref T reference;
    private readonly int height;
    private readonly int width;
    private readonly int pitch;
    private int x;
    private int y;

    // Enumeration code...
}

So in theory if (1) that number of fields for struct promotion was raised to at least 6 and (2) ref T fields were supported too, the whole thing should work just fine in this case even without support for recursive struct promotion at all 🚀

Just my two cents here, but I think that with custom struct enumerators becoming more and more widespread, it would be worth it to lift some of these limitations to ensure that consumers relying on them to get the nice foreach syntax won't have to pay such a high price for the abstraction and syntactic sugar. Ideally this should produce codegen that should be about on par with just doing a manual pair of for loops, which are much more verbose to write and make the code more error prone as well 😊

CarolEidt · 2020-12-02T00:30:05Z

Fixed with #43870

Sergio0694 · 2020-12-02T11:47:40Z

@CarolEidt I just tried again (now that #43870 is merged) with the snippet I previously posted in my comment above (this one), but the codegen is the same in this case, I imagine due to those other limitations Andy mentioned. Since both this and the other linked issue (#13417) are now closed, I was wondering if it could be useful for me to either just copy that comment in another open and related issue, or just to open a new issue for that specifically? I mean, so that it'd be easier to track whenever that particular thing is fixed, as right now it'd be hard to see since this issue is closed already and that comment was not specifically related to it.

Of course, as mentioned above, assuming .NET 6 will get ref T fields, that struct will drop the Span<T> field and just use a ref T and int fields separately, which should make the promotion easier as there won't be nested structs for the JIT to handle. But was still curious if there were plans to handle cases such as this anyway. Thanks! 😊

EDIT: as a test, to exclude that Span<T> complication, I changed the span to just have a T* field and an int field. This gives me the same exact codegen, so I imagine #43870 wasn't meant to address cases with structs this large?

As in, this type (both in the T* test, and with proper ref T fields in the future), would have 6 fields:

public unsafe ref struct Enumerator<T>
    where T : unmanaged
{
    private readonly T* ptr;
    private readonly int height;
    private readonly int width;
    private readonly int pitch;
    private int x;
    private int y;
}

CarolEidt added tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jun 15, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jun 15, 2020

CarolEidt removed the untriaged New issue has not been triaged by the area owner label Jun 15, 2020

CarolEidt added this to the 5.0.0 milestone Jun 15, 2020

CarolEidt self-assigned this Jun 18, 2020

CarolEidt mentioned this issue Jul 14, 2020

Allow enregistering more structs args #39326

Merged

CarolEidt modified the milestones: 5.0.0, 6.0.0 Jul 15, 2020

CarolEidt added a commit to CarolEidt/runtime that referenced this issue Oct 6, 2020

Allow enregistering more structs args

2c51383

Allow HFAs and other structs with matching fields and registers. Contributes to dotnet#37924

CarolEidt closed this as completed in #39326 Oct 7, 2020

CarolEidt added a commit that referenced this issue Oct 7, 2020

Allow enregistering more structs args (#39326)

65a27aa

* Allow enregistering more structs args Allow HFAs and other structs with matching fields and registers. Contributes to #37924

CarolEidt reopened this Oct 7, 2020

CarolEidt mentioned this issue Oct 27, 2020

Keep structs in registers #43867

Closed

10 tasks

CarolEidt closed this as completed Dec 2, 2020

ghost locked as resolved and limited conversation to collaborators Jan 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable promotion of structs passed in multiple registers with matching fields #37924

Enable promotion of structs passed in multiple registers with matching fields #37924

CarolEidt commented Jun 15, 2020 •

edited by BruceForstall

Loading

CarolEidt commented Oct 7, 2020

Sergio0694 commented Oct 24, 2020

AndyAyersMS commented Oct 25, 2020

Sergio0694 commented Oct 25, 2020

CarolEidt commented Dec 2, 2020

Sergio0694 commented Dec 2, 2020 •

edited

Loading

Enable promotion of structs passed in multiple registers with matching fields #37924

Enable promotion of structs passed in multiple registers with matching fields #37924

Comments

CarolEidt commented Jun 15, 2020 • edited by BruceForstall Loading

CarolEidt commented Oct 7, 2020

Sergio0694 commented Oct 24, 2020

AndyAyersMS commented Oct 25, 2020

Sergio0694 commented Oct 25, 2020

CarolEidt commented Dec 2, 2020

Sergio0694 commented Dec 2, 2020 • edited Loading

CarolEidt commented Jun 15, 2020 •

edited by BruceForstall

Loading

Sergio0694 commented Dec 2, 2020 •

edited

Loading