i64x2.bitmask instruction #368

Maratyszcza · 2020-10-01T21:37:42Z

Introduction

This is proposal to add new variant of existing bitmask instruction. The new variant extracts the highest bit of the two 64-bit lanes in a SIMD vector into an 32-bit integer. This variant was left out of #201 without any discussion (maybe @zeux knows why), but would be useful both for orthogonality of the instruction set and for efficiency: x86 natively supports this instruction since SSE2, and on ARM is can be emulated more efficiently than other bitmask variants.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

i64x2.bitmask
- y = i64x2.bitmask(x) is lowered to VMOVMSKPD reg_y, xmm_x

x86/x86-64 processors with SSE2 instruction set

i64x2.bitmask
- y = i64x2.bitmask(x) is lowered to MOVMSKPD reg_y, xmm_x

ARM64 processors

i64x2.bitmask
- y = i64x2.bitmask(x) is lowered to:
  - SQXTN Vtmp.2S, Vx.2D
  - USHR Vtmp.2S, Vtmp.2S, 31
  - USRA Dtmp, Dtmp, 31
  - FMOV Wy, Stmp

ARMv7 processors with NEON instruction set

i64x2.bitmask
- y = i64x2.bitmask(x) is lowered to:
  - VQMOVN.S64 Dtmp, Qx
  - VSHR.U32 Dtmp, Dtmp, 31
  - VSRA.U64 Dtmp, Dtmp, 31
  - VMOV.32 Ry, Dtmp[0]

zeux · 2020-10-01T22:08:45Z

I think this might be an accidental omission - I didn't actually know that SSE2 supports this natively, and I didn't have use cases for 64-bit masks. So it didn't occur to me to propose that because I thought SSE2 lowering would have to scalarize.

As proposed in WebAssembly/simd#368. Differential Revision: https://reviews.llvm.org/D90514

tlively · 2020-10-31T00:25:25Z

This is implemented in LLVM (but not Binaryen) as __builtin_wasm_bitmask_i64x2 and will be available in Emscripten in a few hours. I don't know if we need benchmarking for this instruction or not, though.

akirilov-arm · 2020-11-05T23:06:27Z

I'd like to propose an alternative Arm64 mapping:

ushr Vtmp.2d, Vx.2d, #63
mov Xy, Vtmp.d[0]
mov Xtmp, Vtmp.d[1]
add Wy, Wy, Wtmp, lsl #1

The main advantage of this sequence is that the middle 2 instructions are independent, so they can execute in parallel. In fact, the essentially scalarized version might execute even faster (and require only 1 temporary register), but is 1 instruction longer:

mov Xy, Vx.d[0]
mov Xtmp, Vx.d[1]
lsr Xy, Xy, #63
lsr Xtmp, Xtmp, #63
add Wy, Wy, Wtmp, lsl #1

ngzhian · 2020-11-06T00:21:31Z

Forgot to say that this is prototyped in v8 (x64) https://chromium.googlesource.com/v8/v8/+/ceee7cfe7260152fd90c66657b8476b9d3a8b915

ngzhian · 2020-11-23T08:10:29Z

@akirilov-arm would you suggest a similar mapping for ARMv7 as well?

akirilov-arm · 2020-11-24T14:17:57Z

@ngzhian Something like this (tmp2 = tmp * 2, tmp3 = tmp * 2 + 1):

vshrq.u64 Qtmp, Qx, #63
vmov.32 Ry, Dtmp2[0]
vmov.32 Rtmp, Dtmp3[0]
add Ry, Ry, Rtmp, lsl #1

Note that I haven't tested the sequence and I am also not sure about its performance characteristics - extra latency may crop up due to the SIMD & FP register overlapping rules in AArch32.

- i64x2.eq (WebAssembly/simd#381) - i64x2 widens (WebAssembly/simd#290) - i64x2.bitmask (WebAssembly/simd#368) - signselect ops (WebAssembly/simd#124)

ngzhian · 2020-12-23T02:34:45Z

Prototyped on arm64 as well

ngzhian · 2021-01-11T06:15:51Z

proposals/simd/BinarySIMD.md

@@ -30,6 +30,7 @@ In the description below, `ImmLaneIdx{I}` indicates the maximum value of the byt
 For example, `ImmLaneIdx16` is a byte with values in the range 0-15 (inclusive).


+<<<<<<< HEAD


merge marker

This was accepted into this proposal in WebAssembly#410.

This was accepted into this proposal in #410.

As proposed in WebAssembly/simd#368. Differential Revision: https://reviews.llvm.org/D90514

tlively added a commit to llvm/llvm-project that referenced this pull request Oct 31, 2020

[WebAssembly] Prototype i64x2.bitmask

a787e09

As proposed in WebAssembly/simd#368. Differential Revision: https://reviews.llvm.org/D90514

tlively mentioned this pull request Dec 11, 2020

Prototype SIMD instructions implemented in LLVM WebAssembly/binaryen#3440

Merged

Maratyszcza force-pushed the bitmask-64bit branch from 6073601 to ff9a549 Compare December 23, 2020 20:06

This was referenced Jan 8, 2021

Agenda for sync meeting 1/8/2021 #410

Closed

Tracking instructions with unassigned opcodes #421

Closed

ngzhian reviewed Jan 11, 2021

View reviewed changes

ngzhian added a commit to ngzhian/simd that referenced this pull request Jan 11, 2021

Implement i64x2.bitmask (WebAssembly#368)

961e583

This was accepted into this proposal in WebAssembly#410.

i64x2.bitmask instruction

58883fa

Maratyszcza force-pushed the bitmask-64bit branch from ff9a549 to 58883fa Compare January 11, 2021 17:48

tlively approved these changes Jan 11, 2021

View reviewed changes

dtig approved these changes Jan 11, 2021

View reviewed changes

dtig merged commit dc1646a into WebAssembly:master Jan 11, 2021

abrown mentioned this pull request Jan 11, 2021

i64x2.all_true instructions #415

Merged

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 3, 2021

Implement i64x2.bitmask (WebAssembly#368)

8ba1f36

This was accepted into this proposal in WebAssembly#410.

ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 3, 2021

Implement i64x2.bitmask (WebAssembly#368)

2abcc9f

This was accepted into this proposal in WebAssembly#410.

ngzhian added a commit that referenced this pull request Feb 4, 2021

Implement i64x2.bitmask (#368)

ef06db0

This was accepted into this proposal in #410.

arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 25, 2021

[WebAssembly] Prototype i64x2.bitmask

75c6d24

As proposed in WebAssembly/simd#368. Differential Revision: https://reviews.llvm.org/D90514

nemequ mentioned this pull request May 17, 2021

Add optimized implementations to WASM SIMD128 simd-everywhere/simde#776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i64x2.bitmask instruction #368

i64x2.bitmask instruction #368

Maratyszcza commented Oct 1, 2020 •

edited

Loading

zeux commented Oct 1, 2020

tlively commented Oct 31, 2020

akirilov-arm commented Nov 5, 2020

ngzhian commented Nov 6, 2020

ngzhian commented Nov 23, 2020

akirilov-arm commented Nov 24, 2020

ngzhian commented Dec 23, 2020

ngzhian Jan 11, 2021

Maratyszcza Jan 11, 2021

		@@ -30,6 +30,7 @@ In the description below, `ImmLaneIdx{I}` indicates the maximum value of the byt
		For example, `ImmLaneIdx16` is a byte with values in the range 0-15 (inclusive).


		<<<<<<< HEAD

i64x2.bitmask instruction #368

i64x2.bitmask instruction #368

Conversation

Maratyszcza commented Oct 1, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

zeux commented Oct 1, 2020

tlively commented Oct 31, 2020

akirilov-arm commented Nov 5, 2020

ngzhian commented Nov 6, 2020

ngzhian commented Nov 23, 2020

akirilov-arm commented Nov 24, 2020

ngzhian commented Dec 23, 2020

ngzhian Jan 11, 2021

Choose a reason for hiding this comment

Maratyszcza Jan 11, 2021

Choose a reason for hiding this comment

Maratyszcza commented Oct 1, 2020 •

edited

Loading