Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for wasm-simd saturating-narrow ops. #5854

Merged
merged 21 commits into from
Mar 30, 2021

Conversation

steven-johnson
Copy link
Contributor

(Cannot land until #5853 lands.)

Also, some drive-by clarifications to other wasm-simd instructions in simd_op_check -- some of the yet-to-be-implemented ones are of dubious use in Halide and may not be worth implementing.

// Occasionally useful for debugging: stop at the first failure.
// std::cerr << error_msg.str();
// exit(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intended to submit this?

BTW, this is what I use HL_SIMD_OP_CHECK_FILTER for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I did intend to -- the issue here is that even with HL_SIMD_OP_CHECK_FILTER you can run several tests (different vectorize widths, different factors, etc.). I'll just remove these if it's a nuisance, but I thought it was a handy reminder of where to insert these checks.

ret <8 x i16> %3
}

; The wasm-simd ops always treat inputs as signed; clear the high bit for unsigned inputs to get the right answer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct with 2's complement arithmetic?

I'm not sure it's faster than just not using a pattern. It's just a min followed by a narrow, vs. this implementation is a bitwise op followed by a narrow. Unless we know something like this is faster, it's better to just not do anything special for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct with 2's complement arithmetic?

Yeah, you're right -- I was thinking that anything with the high bit set should be clamped to the max value (true!) but this would be wrong for the case of 0x8001, which should clamp to 0xff, but would become 0x01 with this logic.

I'm not sure it's faster than just not using a pattern. It's just a min followed by a narrow, vs. this implementation is a bitwise op followed by a narrow. Unless we know something like this is faster, it's better to just not do anything special for this.

Leaving it without a specialization causes scalarization of the operation, which is pretty terrible, even if this is a rarely-used operation. Replacing the bitmask with a vector-min operation is perhaps the right thing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this scalarize? If we don't pattern match this, it should just be a vector min followed by a (non-saturating) vector narrowing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the LLVM wasm backend doesn't (yet) have the logic to intelligently vectorize any of these (which is why we're special casing them in the first place). Maybe it will eventually, but we don't need to wait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that simple unsaturated narrowing casts are scalarizing on wasm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last time I checked, yes. (implementing those directly is next on my list; I didn't actually realize that these instructions were saturating until I read the implementation details.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking this morning, the codegen for u16(u32_1) with vectorize = 8 or higher (in recent LLVM13 ) ends up with code with lots of extract_lane and replace_lane ops, which is, distressingly suboptimal when a plain old shuffle will do the trick. I'll offer a PR to do non-saturating narrows before this, since it will make the remaining saturating cases trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: it appears that the current LLVM wasm backend really wants to emit these sort of operations as extract/replace-lane pairs, even if I try to get it to emit (say) shuffle operations directly. Not sure if this is a bug in my code, or a bug in the LLVM backend, or maybe a feature in the LLVM backend (i.e. perhaps current wasm VMs generate better code for those operations via that sequence?) -- I've contacted the relevant folks to get clarity, but until I do, I'm going to just skip the unsigned->unsigned specialization attempt entirely, so we can land the customization for the existing ops.

Base automatically changed from srj/wasm-q15mulr to master March 30, 2021 03:01
@steven-johnson steven-johnson merged commit b7bc8e2 into master Mar 30, 2021
@steven-johnson steven-johnson deleted the srj/wasm-narrowing branch March 30, 2021 16:48
@alexreinking alexreinking added this to the v12.0.0 milestone May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants