-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add one-sided widening intrinsics. #6967
Conversation
Thank you for this PR @rootjalex. As explained on gitter I am still out of the office (Back on Monday 8/29). @aankit-ca @sksarda - Can one of you please test this PR and make sure that the removal of the second pattern in particular doesn't regress. I think for us conv3x3 might be a good test, no? Or our impl. of gaussian5x5. Anything that is accumulating widening multiplies. Definitly check simd_op_check_hvx |
Thanks, and sorry for bothering you on OOTO time!
Fwiw, I did change one check in that test, but I believe it was an improvement |
There appear to be failures in simd_op_check_hvx |
I am investigating - it's quite weird, the STMT is exactly the same for the failing tests (so the lifting is not causing it), but I can't tell how the lowering would impact a shift right narrow expression |
Ah, it was because of a missing '!', whoops. |
@rootjalex @pranavb-ca @sksarda Aren't there chances of overflows with using |
@aankit-ca Does HVX implement wrap-around for multiplication overflow? If so, |
That being said, there are some failing tests, so perhaps I am wrong. I will investigate those tomorrow morning. |
I don't think so. The documentation for |
@aankit-ca Does Hexagon use different multiplication implementations for signed versus unsigned multiplication? I expected it, like all other processors that I know of, to use a single multiply implementation per bitwidth |
@rootjalex Thanks for the explanation. Your change seems right. The same instruction should work for both signed and unsigned numbers. |
@aankit-ca great, thanks for confirming! @abadams Both of the performance failures look like flakes (those two tests have been pretty flaky for me on previous PRs). Think this is good to go? Also @steven-johnson would you be able to check if this PR causes any google failures? |
Will do. |
Thank you! |
It appears this may be injecting some breakage in the C++ backend -- we are missing some overloads for |
if (b.type().code() != narrow_a.type().code()) { | ||
// Need to do a safe reinterpret. | ||
Type t = b.type().with_code(code); | ||
result = widen_right_mul(reinterpret(t, b), narrow_a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This is a follow-up to another comment from Steven regarding failures in C++ backend).
I am a bit confused by reinterpret here, do I understand it correctly that it might turn wild_i32x * u32(wild_u16x) into widen_right_mul(wild_u32x, wild_u16x)?
The problem we are seeing is in the Xtensa backend (not exactly a C++ backend, but it's derived from it; also, lives in a separate branch, so it's a bit harder to keep track of), we currently don't handle this intrinsic, so at the code generation stage we will call lower_intrinsic once we encounter widen_right_mul. As a result, the following will happen:
- we start with the expession like wild_i32x * widen(wild_u16x)
- it gets transformed into widen_right_mul(wild_u32x, wild_u16x)
- it gets lowered back to wild_u32x * widen(wild_u32x) [notice that left operand became unsingned]
If that's correct then the input of the 1) is not equiualent to the output of the 3), which seems a bit problematic? Is this transformation correct from numerical point of view (I guess depends on the actual implementation)? The specific problem we see in the Xtensa backend, is that Xtensa doesn't seem to have an intrinsic for multiplication of two wild_u32x vectors, but does have intrinsics for wild_i32x * wild_i32x and wild_i32x * wild_u32x (I know it's a bit weird and I can try to find out more details about it, but it may take some time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numerically, the expressions are still equivalent, integer multiplication is the same for signed or unsigned arguments.
I see your issue though, and I see two possible solutions:
-
in lowering intrinsics, specifically look for the pattern reinterpret(widen_right_op(reinterpret(a), b)), and lower it without the reinterprets (this seems messy)
-
Xtensa should specifically pattern match widen_right_mul(reinterpret(i32), u16) and use the wild_i32 * wild_u32 op that you mentioned
I believe that the letter is better, what do you think? Feasibly both could be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we could also change the design of the intrinsics to be
a op cast(a.type(), b)
I think there was some reason that we chose not to do that initially. @abadams do you remember why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, 1) would be better in my opinion, but seems to be difficult to implement due to the outer reinterpret(), so probably not worth the effort or complexity.
I certainly can do 2), this should be pretty straigtforward. I was concerned that expressions are not equiualent after find_intrinsics -> lower_intrinsic, but it sounds it should be good numeric-wise.
I think, if this is the only issue we see in Google testing then it should be fine to merge in and I can address the issue before updating Halide in Google.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer we address the Xtensa issue first, so that we can complete a test of this change in Google before landing. (Currently there are a lot of false-positive failures in the test due to this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's cool with me, I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vksnk I realize now that I should have questioned deeper on the Xtensa multiplication intrinsics - because multiplication is not parameterized by sign, shouldn’t the intrinsic that you mentioned being used for i32 x i32 multiplication be used for any 32-bit integer multiplication?
That being said, I’m a little unsure what the i32 x u32 multiplication is used for. Is that a widening multiply by chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to address your concern about find_intrinsics -> lower_intrinsics not matching perfectly - unfortunately, that is already the case for most (possibly all?) of the intrinsics, though I believe these are the only intrinsics that will add reinterprets
Where do we stand on this PR -- are we awaiting the Xtensa fixes, or what? |
Yep, just waiting on Xtensa fixes. Otherwise I think this PR is good to go (or at least, good to be reviewed again). |
Yes, sorry for the delay, it's blocked on me. I needed to finish something else first, but working on the fix for this now. |
No worries, just checking :-) |
Should be good now, I've a fix for Xtensa issue and will update the branch once this is submitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending the necessary changes from @vksnk landing first.
I can't land my changes because this PR (where new intrinsics are introduced) needs to be merged in first. |
UPDATE: @vksnk says he's already done this, so, LGTM :-) |
* implement widen_right_ ops * update HVX patterns with one-sided widening intrinsics * remove unused HVX pattern flags * strengthen logic for finding rounding shifts Co-authored-by: Steven Johnson <[email protected]>
This PR adds three one-sided widening intrinsics:
These intrinsics are not intended to be used in the front end (i.e.
widen_right_add(x_u16, y_u8)
is exactly equivalent to the terserx_u16 + y_u8
due to implicit casting), but are useful for a number of peephole optimizations on ARM and HVX (to come in later PRs).The addition of
widen_right_mul
actually simplifies some of the HexagonOptimize code considerably, and we can remove some of the pattern flags there. This PR also made me realize the semantic equivalence of two of the peephole optimizations that were in HexagonOptimize's OptimizePatternsMul
visitor:these both perform multiplication with
zxt
applied to the second argument, and the first generates a shorter code sequence, so the latter has been removed. Would love for someone from Qualcomm to confirm that this is indeed correct, @pranavb-ca or @aankit-ca ?Given the significant issues with merging #6900 into Google, it might be ideal for Qualcomm/Adobe/Google/others to confirm that this PR does not break anything before merging it in. Other than the HVX optimization, there should be no other differences in codegen, but I have other PRs in the works that use these patterns for improved pattern matching.