Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

i64x2.bitmask instruction #368

Merged
merged 1 commit into from
Jan 11, 2021
Merged

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Oct 1, 2020

Introduction

This is proposal to add new variant of existing bitmask instruction. The new variant extracts the highest bit of the two 64-bit lanes in a SIMD vector into an 32-bit integer. This variant was left out of #201 without any discussion (maybe @zeux knows why), but would be useful both for orthogonality of the instruction set and for efficiency: x86 natively supports this instruction since SSE2, and on ARM is can be emulated more efficiently than other bitmask variants.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • i64x2.bitmask
    • y = i64x2.bitmask(x) is lowered to VMOVMSKPD reg_y, xmm_x

x86/x86-64 processors with SSE2 instruction set

  • i64x2.bitmask
    • y = i64x2.bitmask(x) is lowered to MOVMSKPD reg_y, xmm_x

ARM64 processors

  • i64x2.bitmask
    • y = i64x2.bitmask(x) is lowered to:
      • SQXTN Vtmp.2S, Vx.2D
      • USHR Vtmp.2S, Vtmp.2S, 31
      • USRA Dtmp, Dtmp, 31
      • FMOV Wy, Stmp

ARMv7 processors with NEON instruction set

  • i64x2.bitmask
    • y = i64x2.bitmask(x) is lowered to:
      • VQMOVN.S64 Dtmp, Qx
      • VSHR.U32 Dtmp, Dtmp, 31
      • VSRA.U64 Dtmp, Dtmp, 31
      • VMOV.32 Ry, Dtmp[0]

@zeux
Copy link
Contributor

zeux commented Oct 1, 2020

I think this might be an accidental omission - I didn't actually know that SSE2 supports this natively, and I didn't have use cases for 64-bit masks. So it didn't occur to me to propose that because I thought SSE2 lowering would have to scalarize.

tlively added a commit to llvm/llvm-project that referenced this pull request Oct 31, 2020
@tlively
Copy link
Member

tlively commented Oct 31, 2020

This is implemented in LLVM (but not Binaryen) as __builtin_wasm_bitmask_i64x2 and will be available in Emscripten in a few hours. I don't know if we need benchmarking for this instruction or not, though.

@akirilov-arm
Copy link

I'd like to propose an alternative Arm64 mapping:

ushr Vtmp.2d, Vx.2d, #63
mov Xy, Vtmp.d[0]
mov Xtmp, Vtmp.d[1]
add Wy, Wy, Wtmp, lsl #1

The main advantage of this sequence is that the middle 2 instructions are independent, so they can execute in parallel. In fact, the essentially scalarized version might execute even faster (and require only 1 temporary register), but is 1 instruction longer:

mov Xy, Vx.d[0]
mov Xtmp, Vx.d[1]
lsr Xy, Xy, #63
lsr Xtmp, Xtmp, #63
add Wy, Wy, Wtmp, lsl #1

@ngzhian
Copy link
Member

ngzhian commented Nov 6, 2020

Forgot to say that this is prototyped in v8 (x64) https://chromium.googlesource.com/v8/v8/+/ceee7cfe7260152fd90c66657b8476b9d3a8b915

@ngzhian
Copy link
Member

ngzhian commented Nov 23, 2020

@akirilov-arm would you suggest a similar mapping for ARMv7 as well?

@akirilov-arm
Copy link

@ngzhian Something like this (tmp2 = tmp * 2, tmp3 = tmp * 2 + 1):

vshrq.u64 Qtmp, Qx, #63
vmov.32 Ry, Dtmp2[0]
vmov.32 Rtmp, Dtmp3[0]
add Ry, Ry, Rtmp, lsl #1

Note that I haven't tested the sequence and I am also not sure about its performance characteristics - extra latency may crop up due to the SIMD & FP register overlapping rules in AArch32.

tlively added a commit to tlively/binaryen that referenced this pull request Dec 11, 2020
tlively added a commit to WebAssembly/binaryen that referenced this pull request Dec 12, 2020
@ngzhian
Copy link
Member

ngzhian commented Dec 23, 2020

Prototyped on arm64 as well

@@ -30,6 +30,7 @@ In the description below, `ImmLaneIdx{I}` indicates the maximum value of the byt
For example, `ImmLaneIdx16` is a byte with values in the range 0-15 (inclusive).


<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge marker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

ngzhian added a commit to ngzhian/simd that referenced this pull request Jan 11, 2021
This was accepted into this proposal in WebAssembly#410.
@dtig dtig merged commit dc1646a into WebAssembly:master Jan 11, 2021
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 3, 2021
This was accepted into this proposal in WebAssembly#410.
ngzhian added a commit to ngzhian/simd that referenced this pull request Feb 3, 2021
This was accepted into this proposal in WebAssembly#410.
ngzhian added a commit that referenced this pull request Feb 4, 2021
This was accepted into this proposal in #410.
arichardson pushed a commit to arichardson/llvm-project that referenced this pull request Mar 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants