-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
I think this might be an accidental omission - I didn't actually know that SSE2 supports this natively, and I didn't have use cases for 64-bit masks. So it didn't occur to me to propose that because I thought SSE2 lowering would have to scalarize. |
As proposed in WebAssembly/simd#368. Differential Revision: https://reviews.llvm.org/D90514
This is implemented in LLVM (but not Binaryen) as |
I'd like to propose an alternative Arm64 mapping:
The main advantage of this sequence is that the middle 2 instructions are independent, so they can execute in parallel. In fact, the essentially scalarized version might execute even faster (and require only 1 temporary register), but is 1 instruction longer:
|
Forgot to say that this is prototyped in v8 (x64) https://chromium.googlesource.com/v8/v8/+/ceee7cfe7260152fd90c66657b8476b9d3a8b915 |
@akirilov-arm would you suggest a similar mapping for ARMv7 as well? |
@ngzhian Something like this (
Note that I haven't tested the sequence and I am also not sure about its performance characteristics - extra latency may crop up due to the SIMD & FP register overlapping rules in AArch32. |
- i64x2.eq (WebAssembly/simd#381) - i64x2 widens (WebAssembly/simd#290) - i64x2.bitmask (WebAssembly/simd#368) - signselect ops (WebAssembly/simd#124)
- i64x2.eq (WebAssembly/simd#381) - i64x2 widens (WebAssembly/simd#290) - i64x2.bitmask (WebAssembly/simd#368) - signselect ops (WebAssembly/simd#124)
Prototyped on arm64 as well |
6073601
to
ff9a549
Compare
proposals/simd/BinarySIMD.md
Outdated
@@ -30,6 +30,7 @@ In the description below, `ImmLaneIdx{I}` indicates the maximum value of the byt | |||
For example, `ImmLaneIdx16` is a byte with values in the range 0-15 (inclusive). | |||
|
|||
|
|||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge marker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
This was accepted into this proposal in WebAssembly#410.
ff9a549
to
58883fa
Compare
This was accepted into this proposal in WebAssembly#410.
This was accepted into this proposal in WebAssembly#410.
This was accepted into this proposal in #410.
As proposed in WebAssembly/simd#368. Differential Revision: https://reviews.llvm.org/D90514
Introduction
This is proposal to add new variant of existing
bitmask
instruction. The new variant extracts the highest bit of the two 64-bit lanes in a SIMD vector into an 32-bit integer. This variant was left out of #201 without any discussion (maybe @zeux knows why), but would be useful both for orthogonality of the instruction set and for efficiency: x86 natively supports this instruction since SSE2, and on ARM is can be emulated more efficiently than otherbitmask
variants.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = i64x2.bitmask(x)
is lowered toVMOVMSKPD reg_y, xmm_x
x86/x86-64 processors with SSE2 instruction set
y = i64x2.bitmask(x)
is lowered toMOVMSKPD reg_y, xmm_x
ARM64 processors
y = i64x2.bitmask(x)
is lowered to:SQXTN Vtmp.2S, Vx.2D
USHR Vtmp.2S, Vtmp.2S, 31
USRA Dtmp, Dtmp, 31
FMOV Wy, Stmp
ARMv7 processors with NEON instruction set
y = i64x2.bitmask(x)
is lowered to:VQMOVN.S64 Dtmp, Qx
VSHR.U32 Dtmp, Dtmp, 31
VSRA.U64 Dtmp, Dtmp, 31
VMOV.32 Ry, Dtmp[0]