-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arm64] Implement ASIMD Extract Insert ExtractVector64 ExtractVector128 #35030
[Arm64] Implement ASIMD Extract Insert ExtractVector64 ExtractVector128 #35030
Conversation
…imd.PlatformNotSupported.cs
…h hwintrinsiclistarm64.h namedintrinsiclist.h valuenumfuncs.h
…wintrinsic.h hwintrinsiclistarm64.h
…8 in lowerarmarch.cpp
src/coreclr/src/jit/hwintrinsic.h
Outdated
// NoContainment | ||
// the intrinsic cannot be handled by comtainment, | ||
// all the intrinsic that have explicit memory load/store semantics should have this flag | ||
HW_Flag_NoContainment = 0x40, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is misordered with respect to the rest of the flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I meant to update this before marking this PR as ready for review.
{ | ||
HWIntrinsicImmOpHelper helper(this, intrin.op2, node); | ||
|
||
for (helper.EmitAtFirst(); !helper.Done(); helper.EmitAfterCase()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I get a brief explanation of why the ARM64 jmp table is so much more involved than the x86 one?
For x86, we just needed a small helper method that took in the intrinsic, registers, and a lambda that emitted the contents of each case statement: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsiccodegenxarch.cpp#L1068-L1120
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call this approach more involved - here instead you need a helper class and no lambda at all. This is basically transforming .Select(Action<int> func)
to foreach (int imm in Immediates()) { /* do your action on imm */ }
Below are my ideas why I did it this way.
First, branching on arm64 could be potentially optimized in many different ways (e.g. due to the fact that all the instruction are fixed size).
Also branching at non zero (when imm can only be 0 or 1) is a special case that doesn't require an additional general-purpose register and I though it would be nice to generate more optimal code in this case.
Second, having a lambda instead leads to repetitive code when you first need to check if immOp is const then call the lambda with ival. Otherwise (if it's not const), you call the helper. While with this approach you define and use the code generation logic only once - in a loop - hiding all the details behind HWIntrinsicImmOpHelper. Actually, I implemented the approach with template function first but I didn't like how the code looked - especially for AdvSimd_Insert case - since we need different emitter functions depending on the base type.
I don't like the fact that we declare but NOT define template function in codegen.h. IMO, template functions if used should be defined in a header file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, branching on arm64 could be potentially optimized in many different ways (e.g. due to the fact that all the instruction are fixed size).
Also branching at non zero (when imm can only be 0 or 1) is a special case that doesn't require an additional general-purpose register and I though it would be nice to generate more optimal code in this case.
For these two bits, I don't think the optimization is that important. This is a fallback case meant for debuggers and reflection invocation.
Likewise on x86, the branching is already "optimal" as every case is exactly the same number of bytes (so its just a simple baseAddress + index * caseSize
then jump call)
Second, having a lambda instead leads to repetitive code when you first need to check if immOp is const then call the lambda with ival. Otherwise (if it's not const), you call the helper. While with this approach you define and use the code generation logic only once - in a loop - hiding all the details behind HWIntrinsicImmOpHelper. Actually, I implemented the approach with template function first but I didn't like how the code looked - especially for AdvSimd_Insert case - since we need different emitter functions depending on the base type.
I think this is just a trade-off of do you declare if (const) { lambda } else { emitJumpTable(lambda) }
or declare for () { similarLogicToLambda }
. You still have to redeclare some logic in all the same places, its just what is redeclared that differs
I don't like the fact that we declare but NOT define template function in codegen.h. IMO, template functions if used should be defined in a header file.
It was just placed in the cpp file since that is the only place it will ever be used from, similar to many other functions that aren't meant to be generally reusable (or aren't yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add my two cents - personally, I find lambdas problematic both to understanding and readability of the code, but also to debugging (though the latter will presumably improve over time). I find the approach that Egor has taken here to be quite understandable and readable, though I might make minor changes to the names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually the opposite and find the new code more complex, but its not terribly so and I was mainly interested in why we are differing.
Ideally we'd share these types of constructs as much as possible, rather than having the ARM and x86 code paths drastically differ.
… to avoid ifdef-s at places where this function is used in hwintrinsic.h hwintrinsicarm64.cpp hwintrinsicxarch.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, questions and suggestions.
@@ -541,7 +537,7 @@ GenTree* Compiler::getArgForHWIntrinsic(var_types argType, CORINFO_CLASS_HANDLE | |||
// add a GT_HW_INTRINSIC_CHK node for non-full-range imm-intrinsic, which would throw ArgumentOutOfRangeException | |||
// when the imm-argument is not in the valid range | |||
// | |||
GenTree* Compiler::addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* immOp, bool mustExpand) | |||
GenTree* Compiler::addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* immOp, bool mustExpand, int immUpperBound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new argument should be documented in the header comment.
@@ -541,7 +537,7 @@ GenTree* Compiler::getArgForHWIntrinsic(var_types argType, CORINFO_CLASS_HANDLE | |||
// add a GT_HW_INTRINSIC_CHK node for non-full-range imm-intrinsic, which would throw ArgumentOutOfRangeException | |||
// when the imm-argument is not in the valid range | |||
// | |||
GenTree* Compiler::addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* immOp, bool mustExpand) | |||
GenTree* Compiler::addRangeCheckIfNeeded(NamedIntrinsic intrinsic, GenTree* immOp, bool mustExpand, int immUpperBound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header comment needs to be updated for this additional argument.
@@ -315,6 +335,102 @@ struct HWIntrinsicInfo | |||
} | |||
}; | |||
|
|||
#ifdef TARGET_ARM64 | |||
|
|||
struct HWIntrinsic final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why this wrapper struct is needed and how it is used? It doesn't seem necessarily, and (to me) just obfuscates the creation logic. In any case, comments are needed to explain what this is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wrapper is used when we want to access the operands of GenTreeHWIntrinsic node.
Otherwise, the code that does lookup:
op1 = node->gtGetOp1();
op2 = node->gtGetOp2();
assert(op1 != nullptr);
if (op1->OperIsList())
{
assert(op2 == nullptr);
GenTreeArgList* list = op1->AsArgList();
op1 = list->Current();
list = list->Rest();
op2 = list->Current();
list = list->Rest();
op3 = list->Current();
assert(list->Rest() == nullptr);
numOperands = 3;
}
else if (op2 != nullptr)
{
numOperands = 2;
}
else
{
numOperands = 1;
}
would need to be repeated in Lower, LSRA, CodeGen and, perhaps, other places.
I had this wrapper in CodeGen originally but here I decided to extend its use to the other places.
As an alternative, I can place this code directly in GenTreeHWIntrinsic (or even GenTreeJitIntrinsic).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be cleaner to put it on one of the GenTree
nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I can try this. Would you object me doing this as a separate PR and leave the wrapper as is here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objection.
{ | ||
HWIntrinsicImmOpHelper helper(this, intrin.op2, node); | ||
|
||
for (helper.EmitAtFirst(); !helper.Done(); helper.EmitAfterCase()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add my two cents - personally, I find lambdas problematic both to understanding and readability of the code, but also to debugging (though the latter will presumably improve over time). I find the approach that Egor has taken here to be quite understandable and readable, though I might make minor changes to the names.
…Simd.PlatformNotSupported.cs
…imd.PlatformNotSupported.cs
…mmUpperBound in hwintrinsicarm64.cpp
…en::HWIntrinsicImmOpHelper in hwintrinsiccodegenarm64.cpp
…OpHelper in hwintrinsiccodegenarm64.cpp
…degen.h and hwintrinsiccodegenarm64.cpp
…m with update their values in hwintrinsic.h
…-ExtractVector128
@CarolEidt @tannergooding I believe I addressed all you comments and suggestions (except the one about wrapper struct - I asked if this could be a part of a separate PR). Can you please take a look when you have time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
This implements Extract, Insert, ExtractVector64 and ExtractVector128 intrinsics.
This also implements a way to generate a fallback mechanism for intrinsics accepting an immediate operand when the operand is not constant.
This renames NoContainment flag to SupportsContainment on Arm64 (presumably, there should be fewer intrinsics supporting containment analysis so it makes more sense to have NoContainment as default)
This removes ival column from hwintrinsiclistarm64.h table and the corresponding field in HWIntrinsicInfo struct.
The functionality of Insert and Extract for Vector64<double>, Vector64<long> and Vector64<ulong> will be implemented by CreateScalar() and ToScalar() methods so I removed those from the API surface.
Fixes #34228 and fixes #24588, contributes to #24794 (ExtractVector64 and ExtractVector128)
I put below some examples of the generated code for a fallback "switch" table.
ExtractVector64(Vector64, Vector64, ubyte)
ExtractVector64(Vector64, Vector64, ubyte)
ExtractVector128(Vector128, Vector128, ubyte)