-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] Implement MultiplyHigh #47362
Conversation
Note regarding the This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change. |
Tagging subscribers to this area: @tannergooding Issue DetailsCloses #43106
|
08168be
to
5c1a848
Compare
5c1a848
to
16cd135
Compare
@dotnet/jit-contrib PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@echesakovMSFT out of curiosity what does the JIT generate for public static long BigMul(int a, int b)
{
return ((long)a) * b;
} that indicates the multiplication has to be done as |
@@ -151,6 +151,11 @@ public static unsafe ulong BigMul(ulong a, ulong b, out ulong low) | |||
low = tmp; | |||
return high; | |||
} | |||
else if (ArmBase.Arm64.IsSupported) | |||
{ | |||
low = a * b; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably a code pattern that the JIT should handle and generate UMULL
and SMULL
This has been previously blocked due to codegen issues, but I think the work Carol did around multi-reg returns unblocked that, in which case these should forward to an intrinsic (long, long) BigMul(long, long)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding I am not sure I understand you comment about handling and generating umull
/smull
in context of BigMul(long, long)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, I misread and thought UMULL
had a variant that multiplies two 64-bit and returns the 128-bit result in 2x register like IMUL
and MUL
support on x86/x64.
It looks like it's just 32x32=64
and 64x64=upper 64
(the latter via UMULH
). I don't see a variant that also returns the lower 64
with the same computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually looks like you might be able to achieve it using PMULL
and/or PMULL2
, but it isn't clear if that's "always a win"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe PMULL{2}
can be used at this context - results of the instructions (polynomial multiplication) does't correspond to the result of integer multiplications.
But there is a definitely room for improving 32-bit integer multiplication (see #47490) that would result in more efficient code for BigMul(int, int)
@@ -151,6 +151,11 @@ public static unsafe ulong BigMul(ulong a, ulong b, out ulong low) | |||
low = tmp; | |||
return high; | |||
} | |||
else if (ArmBase.Arm64.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mono will need support for these. Alternatively, you'll need to make Mono return false for IsSupported. Otherwise, any use of BigMul
on arm64 with LLVM JIT/AOT will make Mono crash with a stack overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @echesakovMSFT better just put 0
to
EMIT_NEW_ICONST (cfg, ins, supported ? 1 : 0); |
For reference, here is the IR we need to emit: https://godbolt.org/z/64rvjP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if mono's support could be table driven like it is in RyuJIT.
Then it would (ideally) be relatively trivial most of the time to just say Arm64.MultiplyHigh
maps to llvm.intrinsic....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding While I agree in general, I don't think there is an intrinsic for this particular thing, as you can see from my godbolt link you will have to emit at least 5 different statements for this operation so a table won't help here 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. ARM64 actually doesn't define a C++ intrinsic for this: https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics?search=UMULH, so LLVM won't have a corresponding llvm.intrinsic
entry, 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave it a try and implemented MultiplyHigh
in mono.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime/src/tests/issues.targets
Line 2629 in 1765d03
<ExcludeList Include = "$(XunitTestBinBase)/JIT/HardwareIntrinsics/Arm/ArmBase.Arm64/ArmBase.Arm64_ro/**"> |
disabled on purpose with mono? If so, what would be an alternative way to verify that my changes are correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if they work now. Presumably they failed to pass although nothing was never recorded. They should (and will) be reenabled. That work is being tracked here: #43842
You could try reenabling this specific test and seeing what happens. You could also build Mono as an arm64 AOT cross compiler if you don't have immediate access to any arm64 hardware. Here are some instructions for doing this: https://gist.github.com/imhameed/e4b246dbf0d0b247155fc9e3326194b6
This is not ideal, but this will get better soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried your branch on an arm64 machine.
Given:
[MethodImpl(MethodImplOptions.NoInlining)]
public static ulong test_mulhi(ulong x, ulong y) {
return ArmBase.Arm64.MultiplyHigh(x, y);
}
it yields:
0:» 9bc17c00 » umulh» x0, x0, x1
4:» d65f03c0 » ret
and given:
[MethodImpl(MethodImplOptions.NoInlining)]
public static long test_mulhi(long x, long y) {
return ArmBase.Arm64.MultiplyHigh(x, y);
}
it yields:
0:» 9b417c00 » smulh» x0, x0, x1
4:» d65f03c0 » ret
And the intermediate mini IR and LLVM IR all looks reasonable. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried your branch on an arm64 machine.
@imhameed Thanks a lot for validating the change!
@TamarChristinaArm For ; Assembly listing for method System.Math:BigMul(int,int):long
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; ReadyToRun compilation
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) int -> x0
; V01 arg1 [V01,T01] ( 3, 3 ) int -> x1
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [sp+0x00] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M36318_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
;; bbWeight=1 PerfScore 1.50
G_M36318_IG02: ;; offset=0008H
93407C00 sxtw x0, w0
93407C21 sxtw x1, w1
9B017C00 mul x0, x0, x1
;; bbWeight=1 PerfScore 3.00
G_M36318_IG03: ;; offset=0014H
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr Opened #47490 to track this |
Closes #43106
In addition to implementing the intrinsics I have updated
System.Math:BigMul(long,long,byref):long
implementation in System.Private.CoreLib. The following is the codegen of the methods: