Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] Implement Vector64/128.CreateScalar() using AdvSimd.Insert #35300

Conversation

echesakov
Copy link
Contributor

No description provided.

@ghost
Copy link

ghost commented Apr 22, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

@@ -1369,6 +1370,11 @@ public static unsafe Vector128<ulong> Create(Vector64<ulong> lower, Vector64<ulo
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe Vector128<byte> CreateScalar(byte value)
{
if (AdvSimd.IsSupported)
{
return AdvSimd.Insert(Vector128<byte>.Zero, 0, value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to special-case CreateScalarUnsafe since the upper bits don't have to be zeroed for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this work is tracked by #34485

@echesakov
Copy link
Contributor Author

I still would like to collect the jisDisasm-s for this - I remember seeing something weird for Vector64.CreateScalar() - will do it later.

@echesakov
Copy link
Contributor Author

echesakov commented Apr 30, 2020

; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   ubyte  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M19699_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M19699_IG02:
        4E010FF0          dup     v16.16b, wzr
        53001C00          uxtb    w0, w0
        4E011C10          ins     v16.b[0], w0
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 4.00
G_M19699_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 32, prolog size 8, PerfScore 10.70, (MethodHash=a381b30c) for method System.Runtime.Intrinsics.Vector128:CreateScalar(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(double):System.Runtime.Intrinsics.Vector128`1[Double]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )  double  ->   d0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M6886_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M6886_IG02:
        4E080FF0          dup     v16.2d, xzr
        6E080410          ins     v16.d[0], v0.d[0]
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 3.50
G_M6886_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=17a3e519) for method System.Runtime.Intrinsics.Vector128:CreateScalar(double):System.Runtime.Intrinsics.Vector128`1[Double]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(short):System.Runtime.Intrinsics.Vector128`1[Int16]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   short  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M37120_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M37120_IG02:
        4E020FF0          dup     v16.8h, wzr
        13003C00          sxth    w0, w0
        4E021C10          ins     v16.h[0], w0
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 4.00
G_M37120_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 32, prolog size 8, PerfScore 10.70, (MethodHash=87f66eff) for method System.Runtime.Intrinsics.Vector128:CreateScalar(short):System.Runtime.Intrinsics.Vector128`1[Int16]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(int):System.Runtime.Intrinsics.Vector128`1[Int32]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M42503_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M42503_IG02:
        4E040FF0          dup     v16.4s, wzr
        4E041C10          ins     v16.s[0], w0
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 3.50
G_M42503_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=56ed59f8) for method System.Runtime.Intrinsics.Vector128:CreateScalar(int):System.Runtime.Intrinsics.Vector128`1[Int32]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(long):System.Runtime.Intrinsics.Vector128`1[Int64]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )    long  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M9853_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M9853_IG02:
        4E080FF0          dup     v16.2d, xzr
        4E081C10          ins     v16.d[0], x0
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 3.50
G_M9853_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=99c7d982) for method System.Runtime.Intrinsics.Vector128:CreateScalar(long):System.Runtime.Intrinsics.Vector128`1[Int64]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(byte):System.Runtime.Intrinsics.Vector128`1[SByte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )    byte  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M64149_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M64149_IG02:
        4E010FF0          dup     v16.16b, wzr
        13001C00          sxtb    w0, w0
        4E011C10          ins     v16.b[0], w0
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 4.00
G_M64149_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 32, prolog size 8, PerfScore 10.70, (MethodHash=3757056a) for method System.Runtime.Intrinsics.Vector128:CreateScalar(byte):System.Runtime.Intrinsics.Vector128`1[SByte]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(float):System.Runtime.Intrinsics.Vector128`1[Single]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   float  ->   d0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M28940_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M28940_IG02:
        4E040FF0          dup     v16.4s, wzr
        6E040410          ins     v16.s[0], v0.s[0]
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 3.50
G_M28940_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=910d8ef3) for method System.Runtime.Intrinsics.Vector128:CreateScalar(float):System.Runtime.Intrinsics.Vector128`1[Single]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(ushort):System.Runtime.Intrinsics.Vector128`1[UInt16]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )  ushort  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M480_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M480_IG02:
        4E020FF0          dup     v16.8h, wzr
        53003C00          uxth    w0, w0
        4E021C10          ins     v16.h[0], w0
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 4.00
G_M480_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 32, prolog size 8, PerfScore 10.70, (MethodHash=5a77fe1f) for method System.Runtime.Intrinsics.Vector128:CreateScalar(ushort):System.Runtime.Intrinsics.Vector128`1[UInt16]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(int):System.Runtime.Intrinsics.Vector128`1[UInt32]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M21746_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M21746_IG02:
        4E040FF0          dup     v16.4s, wzr
        4E041C10          ins     v16.s[0], w0
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 3.50
G_M21746_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=4a35ab0d) for method System.Runtime.Intrinsics.Vector128:CreateScalar(int):System.Runtime.Intrinsics.Vector128`1[UInt32]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(long):System.Runtime.Intrinsics.Vector128`1[UInt64]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )    long  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V03 tmp2         [V03,T01] (  2,  2   )  simd16  ->  d16         HFA(simd16)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0

G_M2664_IG01:
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M2664_IG02:
        4E080FF0          dup     v16.2d, xzr
        4E081C10          ins     v16.d[0], x0
        4EB01E00          mov     v0.16b, v16.16b
						;; bbWeight=1    PerfScore 3.50
G_M2664_IG03:
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=714bf597) for method System.Runtime.Intrinsics.Vector128:CreateScalar(long):System.Runtime.Intrinsics.Vector128`1[UInt64]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(int):System.Runtime.Intrinsics.Vector64`1[Int32]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )   simd8  ->  [fp+0x18]   HFA(double)  do-not-enreg[SF] "struct address for call/obj"
;  V03 tmp2         [V03,T02] (  2,  2   )   simd8  ->   d0         HFA(double)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16

G_M25863_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M25863_IG02:
        0E040FE0          dup     v0.2s, wzr
        FD000FA0          str     d0, [fp,#24]
        FD400FA0          ldr     d0, [fp,#24]
        4E041C00          ins     v0.s[0], w0
						;; bbWeight=1    PerfScore 6.00
G_M25863_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 32, prolog size 8, PerfScore 12.70, (MethodHash=80a89af8) for method System.Runtime.Intrinsics.Vector64:CreateScalar(int):System.Runtime.Intrinsics.Vector64`1[Int32]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(byte):System.Runtime.Intrinsics.Vector64`1[SByte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )    byte  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )   simd8  ->  [fp+0x18]   HFA(double)  do-not-enreg[SF] "struct address for call/obj"
;  V03 tmp2         [V03,T02] (  2,  2   )   simd8  ->   d0         HFA(double)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16

G_M12309_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M12309_IG02:
        0E010FE0          dup     v0.8b, wzr
        FD000FA0          str     d0, [fp,#24]
        FD400FA0          ldr     d0, [fp,#24]
        13001C00          sxtb    w0, w0
        4E011C00          ins     v0.b[0], w0
						;; bbWeight=1    PerfScore 6.50
G_M12309_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=1802cfea) for method System.Runtime.Intrinsics.Vector64:CreateScalar(byte):System.Runtime.Intrinsics.Vector64`1[SByte]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(float):System.Runtime.Intrinsics.Vector64`1[Single]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   float  ->   d0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )   simd8  ->  [fp+0x18]   HFA(double)  do-not-enreg[SF] "struct address for call/obj"
;  V03 tmp2         [V03,T02] (  2,  2   )   simd8  ->  d16         HFA(double)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16

G_M44268_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M44268_IG02:
        0E040FF0          dup     v16.2s, wzr
        FD000FB0          str     d16, [fp,#24]
        FD400FB0          ldr     d16, [fp,#24]
        6E040410          ins     v16.s[0], v0.s[0]
        1E604200          fmov    d0, d16
						;; bbWeight=1    PerfScore 6.50
G_M44268_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=b5c65313) for method System.Runtime.Intrinsics.Vector64:CreateScalar(float):System.Runtime.Intrinsics.Vector64`1[Single]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(ushort):System.Runtime.Intrinsics.Vector64`1[UInt16]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )  ushort  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )   simd8  ->  [fp+0x18]   HFA(double)  do-not-enreg[SF] "struct address for call/obj"
;  V03 tmp2         [V03,T02] (  2,  2   )   simd8  ->   d0         HFA(double)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16

G_M37504_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M37504_IG02:
        0E020FE0          dup     v0.4h, wzr
        FD000FA0          str     d0, [fp,#24]
        FD400FA0          ldr     d0, [fp,#24]
        53003C00          uxth    w0, w0
        4E021C00          ins     v0.h[0], w0
						;; bbWeight=1    PerfScore 6.50
G_M37504_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=68536d7f) for method System.Runtime.Intrinsics.Vector64:CreateScalar(ushort):System.Runtime.Intrinsics.Vector64`1[UInt16]
; ============================================================

Collected JIT disassemblies with the changes rebased on top of latest master

; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(int):System.Runtime.Intrinsics.Vector64`1[UInt32]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )   simd8  ->  [fp+0x18]   HFA(double)  do-not-enreg[SF] "struct address for call/obj"
;  V03 tmp2         [V03,T02] (  2,  2   )   simd8  ->   d0         HFA(double)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16

G_M62450_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M62450_IG02:
        0E040FE0          dup     v0.2s, wzr
        FD000FA0          str     d0, [fp,#24]
        FD400FA0          ldr     d0, [fp,#24]
        4E041C00          ins     v0.s[0], w0
						;; bbWeight=1    PerfScore 6.00
G_M62450_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 32, prolog size 8, PerfScore 12.70, (MethodHash=ab590c0d) for method System.Runtime.Intrinsics.Vector64:CreateScalar(int):System.Runtime.Intrinsics.Vector64`1[UInt32]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(ubyte):System.Runtime.Intrinsics.Vector64`1[Byte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   ubyte  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )   simd8  ->  [fp+0x18]   HFA(double)  do-not-enreg[SF] "struct address for call/obj"
;  V03 tmp2         [V03,T02] (  2,  2   )   simd8  ->   d0         HFA(double)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16

G_M20083_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M20083_IG02:
        0E010FE0          dup     v0.8b, wzr
        FD000FA0          str     d0, [fp,#24]
        FD400FA0          ldr     d0, [fp,#24]
        53001C00          uxtb    w0, w0
        4E011C00          ins     v0.b[0], w0
						;; bbWeight=1    PerfScore 6.50
G_M20083_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=cedeb18c) for method System.Runtime.Intrinsics.Vector64:CreateScalar(ubyte):System.Runtime.Intrinsics.Vector64`1[Byte]
; ============================================================
; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(short):System.Runtime.Intrinsics.Vector64`1[Int16]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   short  ->   x0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )   simd8  ->  [fp+0x18]   HFA(double)  do-not-enreg[SF] "struct address for call/obj"
;  V03 tmp2         [V03,T02] (  2,  2   )   simd8  ->   d0         HFA(double)  ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16

G_M58336_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M58336_IG02:
        0E020FE0          dup     v0.4h, wzr
        FD000FA0          str     d0, [fp,#24]
        FD400FA0          ldr     d0, [fp,#24]
        13003C00          sxth    w0, w0
        4E021C00          ins     v0.h[0], w0
						;; bbWeight=1    PerfScore 6.50
G_M58336_IG03:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=e95c1c1f) for method System.Runtime.Intrinsics.Vector64:CreateScalar(short):System.Runtime.Intrinsics.Vector64`1[Int16]
; ============================================================

There are multiple issues here:

  1. Redundant str/ldr-s with a SIMD register - this appears only in Vector64.CreateScalar():
str     d0, [fp,#24]
ldr     d0, [fp,#24]

The code is the worst for Vector64<float>.CreateScalar()

        0E040FF0          dup     v16.2s, wzr
        FD000FB0          str     d16, [fp,#24]
        FD400FB0          ldr     d16, [fp,#24]
        6E040410          ins     v16.s[0], v0.s[0]
        1E604200          fmov    d0, d16

or Vector64<ushort>.CreateScalar()

        0E020FE0          dup     v0.4h, wzr
        FD000FA0          str     d0, [fp,#24]
        FD400FA0          ldr     d0, [fp,#24]
        53003C00          uxth    w0, w0
        4E021C00          ins     v0.h[0], w0
  1. Unnecessary sign-/zero-extensions with byte,ubyte,short,ushort (the same as seen in ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590):
uxtb    w0, w0
uxth    w0, w0
sxtb   w0, w0
sxth    w0, w0
  1. dup Vd.T, wzr seems to be used for code generation of Vector64/128.Zero which I thought was fixed with Implement Vector{Size}<T>.AllBitsSet #33924 (cc @Gnbrkm41). I will follow up on this

cc @kunalspathak @BruceForstall

@echesakov echesakov marked this pull request as ready for review April 30, 2020 19:35
return AdvSimd.Insert(Vector64<byte>.Zero, 0, value);
}

return SoftwareFallback(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about wrapping the SoftwareFallback() in a static method. What is the advantage of doing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, actually. I did this to be consistent with existing Vector128/256 implementations.
@tannergooding Do you know why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the logic for all the various paths the method is too large for the normal inlining heuristics to work so we need to mark it AggressiveInlining (since the accelerated paths will generally be pretty small).
However, we don't necessarily want the SoftwareFallback to be inlined as that may not be beneficial.
Putting it in its own method prevents it from being inlined in the normal case and allows the JIT to decide if it is "too large or not" by itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@echesakov echesakov merged commit 670bf21 into dotnet:master Apr 30, 2020
@echesakov echesakov deleted the Arm64-ASIMD-Vector64-Vector128-CreateScalar-Use-AdvSimd-Insert branch April 30, 2020 23:35
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants