Add one-sided widening intrinsics. #6967

rootjalex · 2022-08-23T18:26:31Z

This PR adds three one-sided widening intrinsics:

widen_right_add(a, b) = a + widen(b)
widen_right_sub(a, b) = a - widen(b)
widen_right_mul(a, b) = a * widen(b)

These intrinsics are not intended to be used in the front end (i.e. widen_right_add(x_u16, y_u8) is exactly equivalent to the terser x_u16 + y_u8 due to implicit casting), but are useful for a number of peephole optimizations on ARM and HVX (to come in later PRs).

The addition of widen_right_mul actually simplifies some of the HexagonOptimize code considerably, and we can remove some of the pattern flags there. This PR also made me realize the semantic equivalence of two of the peephole optimizations that were in HexagonOptimize's OptimizePatterns Mul visitor:

{"halide.hexagon.mul.vw.vuh", wild_i32x * wild_i32x, Pattern::ReinterleaveOp0 | Pattern::NarrowUnsignedOp1},
{"halide.hexagon.mul.vuw.vuh", wild_u32x * wild_u32x, Pattern::ReinterleaveOp0 | Pattern::NarrowUnsignedOp1},

these both perform multiplication with zxt applied to the second argument, and the first generates a shorter code sequence, so the latter has been removed. Would love for someone from Qualcomm to confirm that this is indeed correct, @pranavb-ca or @aankit-ca ?

Given the significant issues with merging #6900 into Google, it might be ideal for Qualcomm/Adobe/Google/others to confirm that this PR does not break anything before merging it in. Other than the HVX optimization, there should be no other differences in codegen, but I have other PRs in the works that use these patterns for improved pattern matching.

pranavb-ca · 2022-08-23T21:37:15Z

Thank you for this PR @rootjalex. As explained on gitter I am still out of the office (Back on Monday 8/29). @aankit-ca @sksarda - Can one of you please test this PR and make sure that the removal of the second pattern in particular doesn't regress. I think for us conv3x3 might be a good test, no? Or our impl. of gaussian5x5. Anything that is accumulating widening multiplies. Definitly check simd_op_check_hvx

rootjalex · 2022-08-23T21:56:45Z

Thanks, and sorry for bothering you on OOTO time!

Definitly check simd_op_check_hvx

Fwiw, I did change one check in that test, but I believe it was an improvement

steven-johnson · 2022-08-24T16:57:21Z

There appear to be failures in simd_op_check_hvx

rootjalex · 2022-08-24T17:19:17Z

There appear to be failures in simd_op_check_hvx

I am investigating - it's quite weird, the STMT is exactly the same for the failing tests (so the lifting is not causing it), but I can't tell how the lowering would impact a shift right narrow expression

rootjalex · 2022-08-24T17:22:09Z

Ah, it was because of a missing '!', whoops.

aankit-ca · 2022-08-24T23:23:42Z

@rootjalex @pranavb-ca @sksarda Aren't there chances of overflows with using halide.hexagon.mul.vw.vuh for widen_right_mul(wild_u32x, wild_u16x)? For eg: INT32_MAX * 2. Using vmpyiewuh might not yield correct results.

rootjalex · 2022-08-25T02:34:32Z

@aankit-ca Does HVX implement wrap-around for multiplication overflow? If so, widen_right_mul((uint32)reinterpret(INT32_MAX), (uint16)2), widen_right_mul((int32)NT32_MAX, (int16)2), and (int32)INT32_MAX * int32((int16)2) should all be exactly equivalent, and should all overflow in the same way.

rootjalex · 2022-08-25T02:35:23Z

That being said, there are some failing tests, so perhaps I am wrong. I will investigate those tomorrow morning.

aankit-ca · 2022-08-25T21:31:45Z

@aankit-ca Does HVX implement wrap-around for multiplication overflow? If so, widen_right_mul((uint32)reinterpret(INT32_MAX), (uint16)2), widen_right_mul((int32)NT32_MAX, (int16)2), and (int32)INT32_MAX * int32((int16)2) should all be exactly equivalent, and should all overflow in the same way.

I don't think so. The documentation for vmpyiewuh does not specify the semantics in case of overflows.

rootjalex · 2022-08-26T18:41:11Z

@aankit-ca Does Hexagon use different multiplication implementations for signed versus unsigned multiplication? I expected it, like all other processors that I know of, to use a single multiply implementation per bitwidth

aankit-ca · 2022-08-31T17:20:22Z

@rootjalex Thanks for the explanation. Your change seems right. The same instruction should work for both signed and unsigned numbers.

rootjalex · 2022-08-31T21:05:03Z

@aankit-ca great, thanks for confirming!

@abadams Both of the performance failures look like flakes (those two tests have been pretty flaky for me on previous PRs). Think this is good to go?

Also @steven-johnson would you be able to check if this PR causes any google failures?

steven-johnson · 2022-08-31T23:44:45Z

Also @steven-johnson would you be able to check if this PR causes any google failures?

Will do.

rootjalex · 2022-08-31T23:57:01Z

Also @steven-johnson would you be able to check if this PR causes any google failures?

Will do.

Thank you!

steven-johnson · 2022-09-01T16:22:53Z

It appears this may be injecting some breakage in the C++ backend -- we are missing some overloads for operator* for some vector operators -- not 100% if it's this PR or not, though, investigating

vksnk · 2022-09-02T16:28:59Z

src/FindIntrinsics.cpp

+                    if (b.type().code() != narrow_a.type().code()) {
+                        // Need to do a safe reinterpret.
+                        Type t = b.type().with_code(code);
+                        result = widen_right_mul(reinterpret(t, b), narrow_a);


(This is a follow-up to another comment from Steven regarding failures in C++ backend).

I am a bit confused by reinterpret here, do I understand it correctly that it might turn wild_i32x * u32(wild_u16x) into widen_right_mul(wild_u32x, wild_u16x)?

The problem we are seeing is in the Xtensa backend (not exactly a C++ backend, but it's derived from it; also, lives in a separate branch, so it's a bit harder to keep track of), we currently don't handle this intrinsic, so at the code generation stage we will call lower_intrinsic once we encounter widen_right_mul. As a result, the following will happen:

we start with the expession like wild_i32x * widen(wild_u16x)

it gets transformed into widen_right_mul(wild_u32x, wild_u16x)

it gets lowered back to wild_u32x * widen(wild_u32x) [notice that left operand became unsingned]

If that's correct then the input of the 1) is not equiualent to the output of the 3), which seems a bit problematic? Is this transformation correct from numerical point of view (I guess depends on the actual implementation)? The specific problem we see in the Xtensa backend, is that Xtensa doesn't seem to have an intrinsic for multiplication of two wild_u32x vectors, but does have intrinsics for wild_i32x * wild_i32x and wild_i32x * wild_u32x (I know it's a bit weird and I can try to find out more details about it, but it may take some time).

Numerically, the expressions are still equivalent, integer multiplication is the same for signed or unsigned arguments.

I see your issue though, and I see two possible solutions:

in lowering intrinsics, specifically look for the pattern reinterpret(widen_right_op(reinterpret(a), b)), and lower it without the reinterprets (this seems messy)

Xtensa should specifically pattern match widen_right_mul(reinterpret(i32), u16) and use the wild_i32 * wild_u32 op that you mentioned

I believe that the letter is better, what do you think? Feasibly both could be implemented.

I guess we could also change the design of the intrinsics to be
a op cast(a.type(), b)
I think there was some reason that we chose not to do that initially. @abadams do you remember why?

yes, 1) would be better in my opinion, but seems to be difficult to implement due to the outer reinterpret(), so probably not worth the effort or complexity.

I certainly can do 2), this should be pretty straigtforward. I was concerned that expressions are not equiualent after find_intrinsics -> lower_intrinsic, but it sounds it should be good numeric-wise.

I think, if this is the only issue we see in Google testing then it should be fine to merge in and I can address the issue before updating Halide in Google.

I'd prefer we address the Xtensa issue first, so that we can complete a test of this change in Google before landing. (Currently there are a lot of false-positive failures in the test due to this.)

That's cool with me, I'll look into it.

@vksnk I realize now that I should have questioned deeper on the Xtensa multiplication intrinsics - because multiplication is not parameterized by sign, shouldn’t the intrinsic that you mentioned being used for i32 x i32 multiplication be used for any 32-bit integer multiplication?

That being said, I’m a little unsure what the i32 x u32 multiplication is used for. Is that a widening multiply by chance?

And to address your concern about find_intrinsics -> lower_intrinsics not matching perfectly - unfortunately, that is already the case for most (possibly all?) of the intrinsics, though I believe these are the only intrinsics that will add reinterprets

steven-johnson · 2022-09-09T18:34:30Z

Where do we stand on this PR -- are we awaiting the Xtensa fixes, or what?

rootjalex · 2022-09-09T18:37:22Z

Where do we stand on this PR -- are we awaiting the Xtensa fixes, or what?

Yep, just waiting on Xtensa fixes. Otherwise I think this PR is good to go (or at least, good to be reviewed again).

vksnk · 2022-09-09T18:41:49Z

Where do we stand on this PR -- are we awaiting the Xtensa fixes, or what?

Yep, just waiting on Xtensa fixes. Otherwise I think this PR is good to go (or at least, good to be reviewed again).

Yes, sorry for the delay, it's blocked on me. I needed to finish something else first, but working on the fix for this now.

steven-johnson · 2022-09-09T20:32:35Z

Yes, sorry for the delay, it's blocked on me.

No worries, just checking :-)

vksnk · 2022-09-12T23:33:50Z

Where do we stand on this PR -- are we awaiting the Xtensa fixes, or what?

Yep, just waiting on Xtensa fixes. Otherwise I think this PR is good to go (or at least, good to be reviewed again).

Yes, sorry for the delay, it's blocked on me. I needed to finish something else first, but working on the fix for this now.

Should be good now, I've a fix for Xtensa issue and will update the branch once this is submitted.

steven-johnson

LGTM pending the necessary changes from @vksnk landing first.

vksnk · 2022-09-13T17:22:13Z

LGTM pending the necessary changes from @vksnk landing first.

I can't land my changes because this PR (where new intrinsics are introduced) needs to be merged in first.

steven-johnson · 2022-09-13T17:26:06Z

I can't land my changes because this PR (where new intrinsics are introduced) needs to be merged in first.

Ahh, right. I guess I should make an experimental branch with both this and your change and do a test integrate first (since the last test turned up breakage). Please point me at your relevant change(s) via Chat and I'll get it cranking.

UPDATE: @vksnk says he's already done this, so, LGTM :-)

* implement widen_right_ ops * update HVX patterns with one-sided widening intrinsics * remove unused HVX pattern flags * strengthen logic for finding rounding shifts Co-authored-by: Steven Johnson <[email protected]>

rootjalex added 5 commits August 22, 2022 15:53

implement widen_right_ ops

420b583

clang-format

70c8cb8

update HVX patterns with one-sided widening intrinsics

81c650b

don't turn VectorReduce nodes into widen_right_adds

95f90bc

remove unused HVX pattern flags

c344e82

rootjalex requested review from steven-johnson and abadams August 23, 2022 18:26

clang tidy

3bcbfe1

stupid bug fix

938bb04

lower widen_right_sub/widen_right_add in HexagonOptimize

8e3a6cf

recursively mutate widen_right variants

2d9ebe0

vksnk reviewed Sep 2, 2022

View reviewed changes

Merge branch 'main' into rootjalex/extend-intrinsics

6c81f33

steven-johnson reviewed Sep 13, 2022

View reviewed changes

steven-johnson self-requested a review September 13, 2022 17:49

steven-johnson approved these changes Sep 13, 2022

View reviewed changes

rootjalex merged commit 27b8a7d into main Sep 13, 2022

rootjalex deleted the rootjalex/extend-intrinsics branch September 13, 2022 18:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add one-sided widening intrinsics. #6967

Add one-sided widening intrinsics. #6967

rootjalex commented Aug 23, 2022

pranavb-ca commented Aug 23, 2022

rootjalex commented Aug 23, 2022

steven-johnson commented Aug 24, 2022

rootjalex commented Aug 24, 2022

rootjalex commented Aug 24, 2022

aankit-ca commented Aug 24, 2022

rootjalex commented Aug 25, 2022

rootjalex commented Aug 25, 2022

aankit-ca commented Aug 25, 2022

rootjalex commented Aug 26, 2022

aankit-ca commented Aug 31, 2022

rootjalex commented Aug 31, 2022

steven-johnson commented Aug 31, 2022

rootjalex commented Aug 31, 2022

steven-johnson commented Sep 1, 2022

vksnk Sep 2, 2022

rootjalex Sep 2, 2022

rootjalex Sep 2, 2022

vksnk Sep 2, 2022

steven-johnson Sep 2, 2022

vksnk Sep 2, 2022

rootjalex Sep 3, 2022

rootjalex Sep 3, 2022

steven-johnson commented Sep 9, 2022

rootjalex commented Sep 9, 2022

vksnk commented Sep 9, 2022

steven-johnson commented Sep 9, 2022

vksnk commented Sep 12, 2022

steven-johnson left a comment

vksnk commented Sep 13, 2022

steven-johnson commented Sep 13, 2022 •

edited

Loading

Add one-sided widening intrinsics. #6967

Add one-sided widening intrinsics. #6967

Conversation

rootjalex commented Aug 23, 2022

pranavb-ca commented Aug 23, 2022

rootjalex commented Aug 23, 2022

steven-johnson commented Aug 24, 2022

rootjalex commented Aug 24, 2022

rootjalex commented Aug 24, 2022

aankit-ca commented Aug 24, 2022

rootjalex commented Aug 25, 2022

rootjalex commented Aug 25, 2022

aankit-ca commented Aug 25, 2022

rootjalex commented Aug 26, 2022

aankit-ca commented Aug 31, 2022

rootjalex commented Aug 31, 2022

steven-johnson commented Aug 31, 2022

rootjalex commented Aug 31, 2022

steven-johnson commented Sep 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steven-johnson commented Sep 9, 2022

rootjalex commented Sep 9, 2022

vksnk commented Sep 9, 2022

steven-johnson commented Sep 9, 2022

vksnk commented Sep 12, 2022

steven-johnson left a comment

Choose a reason for hiding this comment

vksnk commented Sep 13, 2022

steven-johnson commented Sep 13, 2022 • edited Loading

steven-johnson commented Sep 13, 2022 •

edited

Loading