-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for wasm-simd saturating-narrow ops. #5854
Conversation
Also, some drive-by clarifications to other wasm-simd instructions in simd_op_check -- some of the yet-to-be-implemented ones are of dubious use in Halide and may not be worth implementing.
test/correctness/simd_op_check.h
Outdated
|
||
// Occasionally useful for debugging: stop at the first failure. | ||
// std::cerr << error_msg.str(); | ||
// exit(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intended to submit this?
BTW, this is what I use HL_SIMD_OP_CHECK_FILTER for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did intend to -- the issue here is that even with HL_SIMD_OP_CHECK_FILTER you can run several tests (different vectorize widths, different factors, etc.). I'll just remove these if it's a nuisance, but I thought it was a handy reminder of where to insert these checks.
src/runtime/wasm_math.ll
Outdated
ret <8 x i16> %3 | ||
} | ||
|
||
; The wasm-simd ops always treat inputs as signed; clear the high bit for unsigned inputs to get the right answer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct with 2's complement arithmetic?
I'm not sure it's faster than just not using a pattern. It's just a min followed by a narrow, vs. this implementation is a bitwise op followed by a narrow. Unless we know something like this is faster, it's better to just not do anything special for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct with 2's complement arithmetic?
Yeah, you're right -- I was thinking that anything with the high bit set should be clamped to the max value (true!) but this would be wrong for the case of 0x8001, which should clamp to 0xff, but would become 0x01 with this logic.
I'm not sure it's faster than just not using a pattern. It's just a min followed by a narrow, vs. this implementation is a bitwise op followed by a narrow. Unless we know something like this is faster, it's better to just not do anything special for this.
Leaving it without a specialization causes scalarization of the operation, which is pretty terrible, even if this is a rarely-used operation. Replacing the bitmask with a vector-min operation is perhaps the right thing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would this scalarize? If we don't pattern match this, it should just be a vector min followed by a (non-saturating) vector narrowing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the LLVM wasm backend doesn't (yet) have the logic to intelligently vectorize any of these (which is why we're special casing them in the first place). Maybe it will eventually, but we don't need to wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that simple unsaturated narrowing casts are scalarizing on wasm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time I checked, yes. (implementing those directly is next on my list; I didn't actually realize that these instructions were saturating until I read the implementation details.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking this morning, the codegen for u16(u32_1)
with vectorize = 8 or higher (in recent LLVM13 ) ends up with code with lots of extract_lane
and replace_lane
ops, which is, distressingly suboptimal when a plain old shuffle
will do the trick. I'll offer a PR to do non-saturating narrows before this, since it will make the remaining saturating cases trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: it appears that the current LLVM wasm backend really wants to emit these sort of operations as extract/replace-lane pairs, even if I try to get it to emit (say) shuffle operations directly. Not sure if this is a bug in my code, or a bug in the LLVM backend, or maybe a feature in the LLVM backend (i.e. perhaps current wasm VMs generate better code for those operations via that sequence?) -- I've contacted the relevant folks to get clarity, but until I do, I'm going to just skip the unsigned->unsigned specialization attempt entirely, so we can land the customization for the existing ops.
(Cannot land until #5853 lands.)