-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] Implement Vector64/128.CreateScalar() using AdvSimd.Insert #35300
[Arm64] Implement Vector64/128.CreateScalar() using AdvSimd.Insert #35300
Conversation
Tagging subscribers to this area: @tannergooding |
@@ -1369,6 +1370,11 @@ public static unsafe Vector128<ulong> Create(Vector64<ulong> lower, Vector64<ulo | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
public static unsafe Vector128<byte> CreateScalar(byte value) | |||
{ | |||
if (AdvSimd.IsSupported) | |||
{ | |||
return AdvSimd.Insert(Vector128<byte>.Zero, 0, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to special-case CreateScalarUnsafe
since the upper bits don't have to be zeroed for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this work is tracked by #34485
I still would like to collect the jisDisasm-s for this - I remember seeing something weird for Vector64.CreateScalar() - will do it later. |
; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) ubyte -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M19699_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M19699_IG02:
4E010FF0 dup v16.16b, wzr
53001C00 uxtb w0, w0
4E011C10 ins v16.b[0], w0
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 4.00
G_M19699_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 32, prolog size 8, PerfScore 10.70, (MethodHash=a381b30c) for method System.Runtime.Intrinsics.Vector128:CreateScalar(ubyte):System.Runtime.Intrinsics.Vector128`1[Byte]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(double):System.Runtime.Intrinsics.Vector128`1[Double]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) double -> d0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M6886_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M6886_IG02:
4E080FF0 dup v16.2d, xzr
6E080410 ins v16.d[0], v0.d[0]
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 3.50
G_M6886_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=17a3e519) for method System.Runtime.Intrinsics.Vector128:CreateScalar(double):System.Runtime.Intrinsics.Vector128`1[Double]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(short):System.Runtime.Intrinsics.Vector128`1[Int16]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) short -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M37120_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M37120_IG02:
4E020FF0 dup v16.8h, wzr
13003C00 sxth w0, w0
4E021C10 ins v16.h[0], w0
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 4.00
G_M37120_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 32, prolog size 8, PerfScore 10.70, (MethodHash=87f66eff) for method System.Runtime.Intrinsics.Vector128:CreateScalar(short):System.Runtime.Intrinsics.Vector128`1[Int16]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(int):System.Runtime.Intrinsics.Vector128`1[Int32]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M42503_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M42503_IG02:
4E040FF0 dup v16.4s, wzr
4E041C10 ins v16.s[0], w0
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 3.50
G_M42503_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=56ed59f8) for method System.Runtime.Intrinsics.Vector128:CreateScalar(int):System.Runtime.Intrinsics.Vector128`1[Int32]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(long):System.Runtime.Intrinsics.Vector128`1[Int64]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) long -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M9853_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M9853_IG02:
4E080FF0 dup v16.2d, xzr
4E081C10 ins v16.d[0], x0
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 3.50
G_M9853_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=99c7d982) for method System.Runtime.Intrinsics.Vector128:CreateScalar(long):System.Runtime.Intrinsics.Vector128`1[Int64]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(byte):System.Runtime.Intrinsics.Vector128`1[SByte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) byte -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M64149_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M64149_IG02:
4E010FF0 dup v16.16b, wzr
13001C00 sxtb w0, w0
4E011C10 ins v16.b[0], w0
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 4.00
G_M64149_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 32, prolog size 8, PerfScore 10.70, (MethodHash=3757056a) for method System.Runtime.Intrinsics.Vector128:CreateScalar(byte):System.Runtime.Intrinsics.Vector128`1[SByte]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(float):System.Runtime.Intrinsics.Vector128`1[Single]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) float -> d0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M28940_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M28940_IG02:
4E040FF0 dup v16.4s, wzr
6E040410 ins v16.s[0], v0.s[0]
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 3.50
G_M28940_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=910d8ef3) for method System.Runtime.Intrinsics.Vector128:CreateScalar(float):System.Runtime.Intrinsics.Vector128`1[Single]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(ushort):System.Runtime.Intrinsics.Vector128`1[UInt16]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) ushort -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M480_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M480_IG02:
4E020FF0 dup v16.8h, wzr
53003C00 uxth w0, w0
4E021C10 ins v16.h[0], w0
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 4.00
G_M480_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 32, prolog size 8, PerfScore 10.70, (MethodHash=5a77fe1f) for method System.Runtime.Intrinsics.Vector128:CreateScalar(ushort):System.Runtime.Intrinsics.Vector128`1[UInt16]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(int):System.Runtime.Intrinsics.Vector128`1[UInt32]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M21746_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M21746_IG02:
4E040FF0 dup v16.4s, wzr
4E041C10 ins v16.s[0], w0
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 3.50
G_M21746_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=4a35ab0d) for method System.Runtime.Intrinsics.Vector128:CreateScalar(int):System.Runtime.Intrinsics.Vector128`1[UInt32]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector128:CreateScalar(long):System.Runtime.Intrinsics.Vector128`1[UInt64]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) long -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;* V02 tmp1 [V02 ] ( 0, 0 ) simd16 -> zero-ref HFA(simd16) "struct address for call/obj"
; V03 tmp2 [V03,T01] ( 2, 2 ) simd16 -> d16 HFA(simd16) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 0
G_M2664_IG01:
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M2664_IG02:
4E080FF0 dup v16.2d, xzr
4E081C10 ins v16.d[0], x0
4EB01E00 mov v0.16b, v16.16b
;; bbWeight=1 PerfScore 3.50
G_M2664_IG03:
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 28, prolog size 8, PerfScore 9.80, (MethodHash=714bf597) for method System.Runtime.Intrinsics.Vector128:CreateScalar(long):System.Runtime.Intrinsics.Vector128`1[UInt64]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(int):System.Runtime.Intrinsics.Vector64`1[Int32]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 2, 4 ) simd8 -> [fp+0x18] HFA(double) do-not-enreg[SF] "struct address for call/obj"
; V03 tmp2 [V03,T02] ( 2, 2 ) simd8 -> d0 HFA(double) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16
G_M25863_IG01:
A9BE7BFD stp fp, lr, [sp,#-32]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M25863_IG02:
0E040FE0 dup v0.2s, wzr
FD000FA0 str d0, [fp,#24]
FD400FA0 ldr d0, [fp,#24]
4E041C00 ins v0.s[0], w0
;; bbWeight=1 PerfScore 6.00
G_M25863_IG03:
A8C27BFD ldp fp, lr, [sp],#32
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 32, prolog size 8, PerfScore 12.70, (MethodHash=80a89af8) for method System.Runtime.Intrinsics.Vector64:CreateScalar(int):System.Runtime.Intrinsics.Vector64`1[Int32]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(byte):System.Runtime.Intrinsics.Vector64`1[SByte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) byte -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 2, 4 ) simd8 -> [fp+0x18] HFA(double) do-not-enreg[SF] "struct address for call/obj"
; V03 tmp2 [V03,T02] ( 2, 2 ) simd8 -> d0 HFA(double) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16
G_M12309_IG01:
A9BE7BFD stp fp, lr, [sp,#-32]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M12309_IG02:
0E010FE0 dup v0.8b, wzr
FD000FA0 str d0, [fp,#24]
FD400FA0 ldr d0, [fp,#24]
13001C00 sxtb w0, w0
4E011C00 ins v0.b[0], w0
;; bbWeight=1 PerfScore 6.50
G_M12309_IG03:
A8C27BFD ldp fp, lr, [sp],#32
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=1802cfea) for method System.Runtime.Intrinsics.Vector64:CreateScalar(byte):System.Runtime.Intrinsics.Vector64`1[SByte]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(float):System.Runtime.Intrinsics.Vector64`1[Single]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) float -> d0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 2, 4 ) simd8 -> [fp+0x18] HFA(double) do-not-enreg[SF] "struct address for call/obj"
; V03 tmp2 [V03,T02] ( 2, 2 ) simd8 -> d16 HFA(double) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16
G_M44268_IG01:
A9BE7BFD stp fp, lr, [sp,#-32]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M44268_IG02:
0E040FF0 dup v16.2s, wzr
FD000FB0 str d16, [fp,#24]
FD400FB0 ldr d16, [fp,#24]
6E040410 ins v16.s[0], v0.s[0]
1E604200 fmov d0, d16
;; bbWeight=1 PerfScore 6.50
G_M44268_IG03:
A8C27BFD ldp fp, lr, [sp],#32
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=b5c65313) for method System.Runtime.Intrinsics.Vector64:CreateScalar(float):System.Runtime.Intrinsics.Vector64`1[Single]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(ushort):System.Runtime.Intrinsics.Vector64`1[UInt16]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) ushort -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 2, 4 ) simd8 -> [fp+0x18] HFA(double) do-not-enreg[SF] "struct address for call/obj"
; V03 tmp2 [V03,T02] ( 2, 2 ) simd8 -> d0 HFA(double) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16
G_M37504_IG01:
A9BE7BFD stp fp, lr, [sp,#-32]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M37504_IG02:
0E020FE0 dup v0.4h, wzr
FD000FA0 str d0, [fp,#24]
FD400FA0 ldr d0, [fp,#24]
53003C00 uxth w0, w0
4E021C00 ins v0.h[0], w0
;; bbWeight=1 PerfScore 6.50
G_M37504_IG03:
A8C27BFD ldp fp, lr, [sp],#32
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=68536d7f) for method System.Runtime.Intrinsics.Vector64:CreateScalar(ushort):System.Runtime.Intrinsics.Vector64`1[UInt16]
; ============================================================ Collected JIT disassemblies with the changes rebased on top of latest master ; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(int):System.Runtime.Intrinsics.Vector64`1[UInt32]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 2, 4 ) simd8 -> [fp+0x18] HFA(double) do-not-enreg[SF] "struct address for call/obj"
; V03 tmp2 [V03,T02] ( 2, 2 ) simd8 -> d0 HFA(double) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16
G_M62450_IG01:
A9BE7BFD stp fp, lr, [sp,#-32]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M62450_IG02:
0E040FE0 dup v0.2s, wzr
FD000FA0 str d0, [fp,#24]
FD400FA0 ldr d0, [fp,#24]
4E041C00 ins v0.s[0], w0
;; bbWeight=1 PerfScore 6.00
G_M62450_IG03:
A8C27BFD ldp fp, lr, [sp],#32
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 32, prolog size 8, PerfScore 12.70, (MethodHash=ab590c0d) for method System.Runtime.Intrinsics.Vector64:CreateScalar(int):System.Runtime.Intrinsics.Vector64`1[UInt32]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(ubyte):System.Runtime.Intrinsics.Vector64`1[Byte]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) ubyte -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 2, 4 ) simd8 -> [fp+0x18] HFA(double) do-not-enreg[SF] "struct address for call/obj"
; V03 tmp2 [V03,T02] ( 2, 2 ) simd8 -> d0 HFA(double) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16
G_M20083_IG01:
A9BE7BFD stp fp, lr, [sp,#-32]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M20083_IG02:
0E010FE0 dup v0.8b, wzr
FD000FA0 str d0, [fp,#24]
FD400FA0 ldr d0, [fp,#24]
53001C00 uxtb w0, w0
4E011C00 ins v0.b[0], w0
;; bbWeight=1 PerfScore 6.50
G_M20083_IG03:
A8C27BFD ldp fp, lr, [sp],#32
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=cedeb18c) for method System.Runtime.Intrinsics.Vector64:CreateScalar(ubyte):System.Runtime.Intrinsics.Vector64`1[Byte]
; ============================================================ ; Assembly listing for method System.Runtime.Intrinsics.Vector64:CreateScalar(short):System.Runtime.Intrinsics.Vector64`1[Int16]
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) short -> x0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T01] ( 2, 4 ) simd8 -> [fp+0x18] HFA(double) do-not-enreg[SF] "struct address for call/obj"
; V03 tmp2 [V03,T02] ( 2, 2 ) simd8 -> d0 HFA(double) ld-addr-op "Inline ldloca(s) first use temp"
;
; Lcl frame size = 16
G_M58336_IG01:
A9BE7BFD stp fp, lr, [sp,#-32]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M58336_IG02:
0E020FE0 dup v0.4h, wzr
FD000FA0 str d0, [fp,#24]
FD400FA0 ldr d0, [fp,#24]
13003C00 sxth w0, w0
4E021C00 ins v0.h[0], w0
;; bbWeight=1 PerfScore 6.50
G_M58336_IG03:
A8C27BFD ldp fp, lr, [sp],#32
D65F03C0 ret lr
;; bbWeight=1 PerfScore 2.00
; Total bytes of code 36, prolog size 8, PerfScore 13.60, (MethodHash=e95c1c1f) for method System.Runtime.Intrinsics.Vector64:CreateScalar(short):System.Runtime.Intrinsics.Vector64`1[Int16]
; ============================================================ There are multiple issues here:
str d0, [fp,#24]
ldr d0, [fp,#24] The code is the worst for Vector64<float>.CreateScalar() 0E040FF0 dup v16.2s, wzr
FD000FB0 str d16, [fp,#24]
FD400FB0 ldr d16, [fp,#24]
6E040410 ins v16.s[0], v0.s[0]
1E604200 fmov d0, d16 or Vector64<ushort>.CreateScalar() 0E020FE0 dup v0.4h, wzr
FD000FA0 str d0, [fp,#24]
FD400FA0 ldr d0, [fp,#24]
53003C00 uxth w0, w0
4E021C00 ins v0.h[0], w0
uxtb w0, w0
uxth w0, w0
sxtb w0, w0
sxth w0, w0
|
return AdvSimd.Insert(Vector64<byte>.Zero, 0, value); | ||
} | ||
|
||
return SoftwareFallback(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about wrapping the SoftwareFallback()
in a static method. What is the advantage of doing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, actually. I did this to be consistent with existing Vector128/256 implementations.
@tannergooding Do you know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the logic for all the various paths the method is too large for the normal inlining heuristics to work so we need to mark it AggressiveInlining
(since the accelerated paths will generally be pretty small).
However, we don't necessarily want the SoftwareFallback to be inlined as that may not be beneficial.
Putting it in its own method prevents it from being inlined in the normal case and allows the JIT to decide if it is "too large or not" by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.