Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] Implement ASIMD Extract Insert ExtractVector64 ExtractVector128 #35030

Conversation

echesakov
Copy link
Contributor

@echesakov echesakov commented Apr 15, 2020

  • This implements Extract, Insert, ExtractVector64 and ExtractVector128 intrinsics.

  • This also implements a way to generate a fallback mechanism for intrinsics accepting an immediate operand when the operand is not constant.

  • This renames NoContainment flag to SupportsContainment on Arm64 (presumably, there should be fewer intrinsics supporting containment analysis so it makes more sense to have NoContainment as default)

  • This removes ival column from hwintrinsiclistarm64.h table and the corresponding field in HWIntrinsicInfo struct.

  • The functionality of Insert and Extract for Vector64<double>, Vector64<long> and Vector64<ulong> will be implemented by CreateScalar() and ToScalar() methods so I removed those from the API surface.

Fixes #34228 and fixes #24588, contributes to #24794 (ExtractVector64 and ExtractVector128)

I put below some examples of the generated code for a fallback "switch" table.

ExtractVector64(Vector64, Vector64, ubyte)

; Assembly listing for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector64(System.Runtime.Intrinsics.Vector64`1[Byte],System.Runtime.Intrinsics.Vector64`1[Byte],ubyte):System.Runtime.Intrinsics.Vector64`1[Byte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  3,  3   )   simd8  ->  [fp+0x28]   HFA(double)  do-not-enreg[XS] addr-exposed
;  V01 arg1         [V01    ] (  3,  3   )   simd8  ->  [fp+0x18]   HFA(double)  do-not-enreg[XS] addr-exposed
;  V02 arg2         [V02,T00] (  3,  3   )   ubyte  ->   x0
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 cse0         [V04,T01] (  3,  3   )     int  ->   x0         "CSE - aggressive"
;
; Lcl frame size = 32

G_M23204_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        FD0017A0          str     d0, [fp,#40]
        FD000FA1          str     d1, [fp,#24]
                                                ;; bbWeight=1    PerfScore 3.50
G_M23204_IG02:
        FD4017A0          ldr     d0, [fp,#40]
        FD400FB0          ldr     d16, [fp,#24]
        53001C00          uxtb    w0, w0
        7100201F          cmp     w0, #8
        540002A2          bhs     G_M23204_IG12
        10000061          adr     x1, [G_M23204_IG03]
        8B000C21          add     x1, x1, x0, LSL #3
        D61F0020          br      x1
                                                ;; bbWeight=1    PerfScore 8.50
G_M23204_IG03:
        2E100000          ext     v0.8b, v0.8b, v16.8b, #0
        1400000E          b       G_M23204_IG11
                                                ;; bbWeight=1    PerfScore 2.00
G_M23204_IG04:
        2E100800          ext     v0.8b, v0.8b, v16.8b, #1
        1400000C          b       G_M23204_IG11
                                                ;; bbWeight=1    PerfScore 2.00
G_M23204_IG05:
        2E101000          ext     v0.8b, v0.8b, v16.8b, #2
        1400000A          b       G_M23204_IG11
                                                ;; bbWeight=1    PerfScore 2.00
G_M23204_IG06:
        2E101800          ext     v0.8b, v0.8b, v16.8b, #3
        14000008          b       G_M23204_IG11
                                                ;; bbWeight=1    PerfScore 2.00
G_M23204_IG07:
        2E102000          ext     v0.8b, v0.8b, v16.8b, #4
        14000006          b       G_M23204_IG11
                                                ;; bbWeight=1    PerfScore 2.00
G_M23204_IG08:
        2E102800          ext     v0.8b, v0.8b, v16.8b, #5
        14000004          b       G_M23204_IG11
                                                ;; bbWeight=1    PerfScore 2.00
G_M23204_IG09:
        2E103000          ext     v0.8b, v0.8b, v16.8b, #6
        14000002          b       G_M23204_IG11
                                                ;; bbWeight=1    PerfScore 2.00
G_M23204_IG10:
        2E103800          ext     v0.8b, v0.8b, v16.8b, #7
                                                ;; bbWeight=1    PerfScore 1.00
G_M23204_IG11:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
G_M23204_IG12:
        97FE5A87          bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
        D43E0000          bkpt
                                                ;; bbWeight=0    PerfScore 0.00

; Total bytes of code 124, prolog size 8, PerfScore 41.40, (MethodHash=aa85a55b) for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector64(System.Runtime.Intrinsics.Vector64`1[Byte],System.Runtime.Intrinsics.Vector64`1[Byte],ubyte):System.Runtime.Intrinsics.Vector64`1[Byte]
; ============================================================

ExtractVector64(Vector64, Vector64, ubyte)

; Assembly listing for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector64(System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],ubyte):System.Runtime.Intrinsics.Vector64`1[Single]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  3,  3   )   simd8  ->  [fp+0x28]   HFA(double)  do-not-enreg[XS] addr-exposed
;  V01 arg1         [V01    ] (  3,  3   )   simd8  ->  [fp+0x18]   HFA(double)  do-not-enreg[XS] addr-exposed
;  V02 arg2         [V02,T00] (  3,  3   )   ubyte  ->   x0
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 cse0         [V04,T01] (  3,  3   )     int  ->   x0         "CSE - aggressive"
;
; Lcl frame size = 32

G_M6420_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        FD0017A0          str     d0, [fp,#40]
        FD000FA1          str     d1, [fp,#24]
                                                ;; bbWeight=1    PerfScore 3.50
G_M6420_IG02:
        FD4017A0          ldr     d0, [fp,#40]
        FD400FB0          ldr     d16, [fp,#24]
        53001C00          uxtb    w0, w0
        7100081F          cmp     w0, #2
        540000E2          bhs     G_M6420_IG06
        35000060          cbnz    w0, G_M6420_IG04
                                                ;; bbWeight=1    PerfScore 7.00
G_M6420_IG03:
        2E100000          ext     v0.8b, v0.8b, v16.8b, #0
        14000002          b       G_M6420_IG05
                                                ;; bbWeight=1    PerfScore 2.00
G_M6420_IG04:
        2E102000          ext     v0.8b, v0.8b, v16.8b, #4
                                                ;; bbWeight=1    PerfScore 1.00
G_M6420_IG05:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
G_M6420_IG06:
        97FE57CD          bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
        D43E0000          bkpt
                                                ;; bbWeight=0    PerfScore 0.00

; Total bytes of code 68, prolog size 8, PerfScore 22.30, (MethodHash=821fe6eb) for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector64(System.Runtime.Intrinsics.Vector64`1[Single],System.Runtime.Intrinsics.Vector64`1[Single],ubyte):System.Runtime.Intrinsics.Vector64`1[Single]
; ============================================================

ExtractVector128(Vector128, Vector128, ubyte)

; Assembly listing for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],ubyte):System.Runtime.Intrinsics.Vector128`1[Double]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00    ] (  3,  3   )  simd16  ->  [fp+0x20]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V01 arg1         [V01    ] (  3,  3   )  simd16  ->  [fp+0x10]   HFA(simd16)  do-not-enreg[XS] addr-exposed
;  V02 arg2         [V02,T00] (  3,  3   )   ubyte  ->   x0
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 cse0         [V04,T01] (  3,  3   )     int  ->   x0         "CSE - aggressive"
;
; Lcl frame size = 32

G_M55355_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
        3D800BA0          str     q0, [fp,#32]
        3D8007A1          str     q1, [fp,#16]
                                                ;; bbWeight=1    PerfScore 3.50
G_M55355_IG02:
        3DC00BB0          ldr     q16, [fp,#32]
        3DC007B1          ldr     q17, [fp,#16]
        53001C00          uxtb    w0, w0
        7100081F          cmp     w0, #2
        54000102          bhs     G_M55355_IG07
        35000060          cbnz    w0, G_M55355_IG04
                                                ;; bbWeight=1    PerfScore 7.00
G_M55355_IG03:
        6E110210          ext     v16.16b, v16.16b, v17.16b, #0
        14000002          b       G_M55355_IG05
                                                ;; bbWeight=1    PerfScore 2.00
G_M55355_IG04:
        6E114210          ext     v16.16b, v16.16b, v17.16b, #8
                                                ;; bbWeight=1    PerfScore 1.00
G_M55355_IG05:
        4EB01E00          mov     v0.16b, v16.16b
                                                ;; bbWeight=1    PerfScore 0.50
G_M55355_IG06:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
G_M55355_IG07:
        97FE5F66          bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
        D43E0000          bkpt
                                                ;; bbWeight=0    PerfScore 0.00

; Total bytes of code 72, prolog size 8, PerfScore 23.20, (MethodHash=721027c4) for method System.Runtime.Intrinsics.Arm.AdvSimd:ExtractVector128(System.Runtime.Intrinsics.Vector128`1[Double],System.Runtime.Intrinsics.Vector128`1[Double],ubyte):System.Runtime.Intrinsics.Vector128`1[Double]
; ============================================================

…h hwintrinsiclistarm64.h namedintrinsiclist.h valuenumfuncs.h
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 15, 2020
// NoContainment
// the intrinsic cannot be handled by comtainment,
// all the intrinsic that have explicit memory load/store semantics should have this flag
HW_Flag_NoContainment = 0x40,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is misordered with respect to the rest of the flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I meant to update this before marking this PR as ready for review.

{
HWIntrinsicImmOpHelper helper(this, intrin.op2, node);

for (helper.EmitAtFirst(); !helper.Done(); helper.EmitAfterCase())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I get a brief explanation of why the ARM64 jmp table is so much more involved than the x86 one?

For x86, we just needed a small helper method that took in the intrinsic, registers, and a lambda that emitted the contents of each case statement: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsiccodegenxarch.cpp#L1068-L1120

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call this approach more involved - here instead you need a helper class and no lambda at all. This is basically transforming .Select(Action<int> func) to foreach (int imm in Immediates()) { /* do your action on imm */ }

Below are my ideas why I did it this way.

First, branching on arm64 could be potentially optimized in many different ways (e.g. due to the fact that all the instruction are fixed size).
Also branching at non zero (when imm can only be 0 or 1) is a special case that doesn't require an additional general-purpose register and I though it would be nice to generate more optimal code in this case.

Second, having a lambda instead leads to repetitive code when you first need to check if immOp is const then call the lambda with ival. Otherwise (if it's not const), you call the helper. While with this approach you define and use the code generation logic only once - in a loop - hiding all the details behind HWIntrinsicImmOpHelper. Actually, I implemented the approach with template function first but I didn't like how the code looked - especially for AdvSimd_Insert case - since we need different emitter functions depending on the base type.

I don't like the fact that we declare but NOT define template function in codegen.h. IMO, template functions if used should be defined in a header file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, branching on arm64 could be potentially optimized in many different ways (e.g. due to the fact that all the instruction are fixed size).
Also branching at non zero (when imm can only be 0 or 1) is a special case that doesn't require an additional general-purpose register and I though it would be nice to generate more optimal code in this case.

For these two bits, I don't think the optimization is that important. This is a fallback case meant for debuggers and reflection invocation.
Likewise on x86, the branching is already "optimal" as every case is exactly the same number of bytes (so its just a simple baseAddress + index * caseSize then jump call)

Second, having a lambda instead leads to repetitive code when you first need to check if immOp is const then call the lambda with ival. Otherwise (if it's not const), you call the helper. While with this approach you define and use the code generation logic only once - in a loop - hiding all the details behind HWIntrinsicImmOpHelper. Actually, I implemented the approach with template function first but I didn't like how the code looked - especially for AdvSimd_Insert case - since we need different emitter functions depending on the base type.

I think this is just a trade-off of do you declare if (const) { lambda } else { emitJumpTable(lambda) } or declare for () { similarLogicToLambda }. You still have to redeclare some logic in all the same places, its just what is redeclared that differs

I don't like the fact that we declare but NOT define template function in codegen.h. IMO, template functions if used should be defined in a header file.

It was just placed in the cpp file since that is the only place it will ever be used from, similar to many other functions that aren't meant to be generally reusable (or aren't yet).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add my two cents - personally, I find lambdas problematic both to understanding and readability of the code, but also to debugging (though the latter will presumably improve over time). I find the approach that Egor has taken here to be quite understandable and readable, though I might make minor changes to the names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually the opposite and find the new code more complex, but its not terribly so and I was mainly interested in why we are differing.
Ideally we'd share these types of constructs as much as possible, rather than having the ARM and x86 code paths drastically differ.

… to avoid ifdef-s at places where this function is used in hwintrinsic.h hwintrinsicarm64.cpp hwintrinsicxarch.cpp
Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, questions and suggestions.

@@ -541,7 +537,7 @@ GenTree* Compiler::getArgForHWIntrinsic(var_types argType, CORINFO_CLASS_HANDLE
// add a GT_HW_INTRINSIC_CHK node for non-full-range imm-intrinsic, which would throw ArgumentOutOfRangeException
// when the imm-argument is not in the valid range
//
GenTree* Compiler::addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* immOp, bool mustExpand)
GenTree* Compiler::addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* immOp, bool mustExpand, int immUpperBound)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new argument should be documented in the header comment.

@@ -541,7 +537,7 @@ GenTree* Compiler::getArgForHWIntrinsic(var_types argType, CORINFO_CLASS_HANDLE
// add a GT_HW_INTRINSIC_CHK node for non-full-range imm-intrinsic, which would throw ArgumentOutOfRangeException
// when the imm-argument is not in the valid range
//
GenTree* Compiler::addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* immOp, bool mustExpand)
GenTree* Compiler::addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* immOp, bool mustExpand, int immUpperBound)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment needs to be updated for this additional argument.

@@ -315,6 +335,102 @@ struct HWIntrinsicInfo
}
};

#ifdef TARGET_ARM64

struct HWIntrinsic final
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why this wrapper struct is needed and how it is used? It doesn't seem necessarily, and (to me) just obfuscates the creation logic. In any case, comments are needed to explain what this is for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrapper is used when we want to access the operands of GenTreeHWIntrinsic node.

Otherwise, the code that does lookup:

        op1 = node->gtGetOp1();
        op2 = node->gtGetOp2();

        assert(op1 != nullptr);

        if (op1->OperIsList())
        {
            assert(op2 == nullptr);

            GenTreeArgList* list = op1->AsArgList();
            op1                  = list->Current();
            list                 = list->Rest();
            op2                  = list->Current();
            list                 = list->Rest();
            op3                  = list->Current();

            assert(list->Rest() == nullptr);

            numOperands = 3;
        }
        else if (op2 != nullptr)
        {
            numOperands = 2;
        }
        else
        {
            numOperands = 1;
        }

would need to be repeated in Lower, LSRA, CodeGen and, perhaps, other places.

I had this wrapper in CodeGen originally but here I decided to extend its use to the other places.

As an alternative, I can place this code directly in GenTreeHWIntrinsic (or even GenTreeJitIntrinsic).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be cleaner to put it on one of the GenTree nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I can try this. Would you object me doing this as a separate PR and leave the wrapper as is here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection.

src/coreclr/src/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/hwintrinsiccodegenarm64.cpp Outdated Show resolved Hide resolved
{
HWIntrinsicImmOpHelper helper(this, intrin.op2, node);

for (helper.EmitAtFirst(); !helper.Done(); helper.EmitAfterCase())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add my two cents - personally, I find lambdas problematic both to understanding and readability of the code, but also to debugging (though the latter will presumably improve over time). I find the approach that Egor has taken here to be quite understandable and readable, though I might make minor changes to the names.

@echesakov echesakov closed this Apr 21, 2020
@echesakov echesakov reopened this Apr 21, 2020
@echesakov echesakov marked this pull request as ready for review April 21, 2020 20:22
@echesakov
Copy link
Contributor Author

@CarolEidt @tannergooding I believe I addressed all you comments and suggestions (except the one about wrapper struct - I asked if this could be a part of a separate PR). Can you please take a look when you have time?

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@echesakov echesakov merged commit 32dd7d4 into dotnet:master Apr 22, 2020
@echesakov echesakov deleted the Arm64-ASIMD-Extract-Insert-ExtractVector64-ExtractVector128 branch April 22, 2020 18:01
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation
Projects
None yet
5 participants