Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

i64x2.gt_u, i64x2.lt_u, i64x2.ge_u, and i64x2.le_u instructions #414

Closed
wants to merge 1 commit into from

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented Dec 29, 2020

Introduction

This is proposal to add 64-bit variant of existing gt_u, lt_u, ge_u, and le_u instructions. ARM64 and x86-64 XOP natively support these instructions, but on other instruction sets they need to be emulated. On SSE4.2 emulation costs 5-6 instructions, but on older SSE extension and on ARMv7 NEON the emulation cost is more significant.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F, AVX512DQ, and AVX512VL instruction sets

  • i64x2.gt_u
    • y = i64x2.gt_u(a, b) is lowered to:
      • VPCMPUQ k_tmp, xmm_a, xmm_b, 6
      • VPMOVM2Q xmm_y, k_tmp
  • i64x2.lt_u
    • y = i64x2.lt_u(a, b) is lowered to:
      • VPCMPUQ k_tmp, xmm_a, xmm_b, 1
      • VPMOVM2Q xmm_y, k_tmp
  • i64x2.ge_u
    • y = i64x2.ge_u(a, b) is lowered to:
      • VPCMPUQ k_tmp, xmm_a, xmm_b, 5
      • VPMOVM2Q xmm_y, k_tmp
  • i64x2.le_u
    • y = i64x2.le_u(a, b) is lowered to:
      • VPCMPUQ k_tmp, xmm_a, xmm_b, 2
      • VPMOVM2Q xmm_y, k_tmp

x86/x86-64 processors with XOP instruction set

  • i64x2.gt_u
    • y = i64x2.gt_u(a, b) is lowered to VPCOMGTUQ xmm_y, xmm_a, xmm_b
  • i64x2.lt_u
    • y = i64x2.lt_u(a, b) is lowered to VPCOMLTUQ xmm_y, xmm_a, xmm_b
  • i64x2.ge_u
    • y = i64x2.ge_u(a, b) is lowered to VPCOMGEUQ xmm_y, xmm_a, xmm_b
  • i64x2.le_u
    • y = i64x2.le_u(a, b) is lowered to VPCOMLEUQ xmm_y, xmm_a, xmm_b

x86/x86-64 processors with AVX instruction set

  • i64x2.gt_u
    • y = i64x2.gt_u(a, b) (y is not b) is lowered to:
      • VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
      • VPXOR xmm_y, xmm_y, xmm_tmp
      • VPXOR xmm_tmp, xmm_b, xmm_tmp
      • VPCMPGTQ xmm_y, xmm_y, xmm_tmp
  • i64x2.lt_u
    • y = i64x2.lt_u(a, b) (y is not b) is lowered to:
      • VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
      • VPXOR xmm_y, xmm_y, xmm_tmp
      • VPXOR xmm_tmp, xmm_b, xmm_tmp
      • VPCMPGTQ xmm_y, xmm_tmp, xmm_y
  • i64x2.ge_u
    • y = i64x2.ge_u(a, b) (y is not b) is lowered to:
      • VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
      • VPXOR xmm_y, xmm_y, xmm_tmp
      • VPXOR xmm_tmp, xmm_b, xmm_tmp
      • VPCMPGTQ xmm_y, xmm_tmp, xmm_y
      • VPXOR xmm_y, xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
  • i64x2.le_u
    • y = i64x2.le_u(a, b) (y is not b) is lowered to:
      • VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
      • VPXOR xmm_y, xmm_y, xmm_tmp
      • VPXOR xmm_tmp, xmm_b, xmm_tmp
      • VPCMPGTQ xmm_y, xmm_y, xmm_tmp
      • VPXOR xmm_y, xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]

x86/x86-64 processors with SSE4.2 instruction set

  • i64x2.gt_u
    • y = i64x2.gt_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
      • MOVDQA xmm_tmp, xmm_y
      • PXOR xmm_y, xmm_a
      • PXOR xmm_tmp, xmm_b
      • PCMPGTQ xmm_y, xmm_tmp
  • i64x2.lt_u
    • y = i64x2.lt_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
      • MOVDQA xmm_tmp, xmm_y
      • PXOR xmm_y, xmm_b
      • PXOR xmm_tmp, xmm_a
      • PCMPGTQ xmm_y, xmm_tmp
  • i64x2.ge_u
    • y = i64x2.ge_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
      • MOVDQA xmm_tmp, xmm_y
      • PXOR xmm_y, xmm_b
      • PXOR xmm_tmp, xmm_a
      • PCMPGTQ xmm_y, xmm_tmp
      • PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
  • i64x2.le_u
    • y = i64x2.le_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
      • MOVDQA xmm_tmp, xmm_y
      • PXOR xmm_y, xmm_b
      • PXOR xmm_tmp, xmm_a
      • PCMPGTQ xmm_y, xmm_tmp
      • PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]

x86/x86-64 processors with SSE2 instruction set

Based on this answer by user aqrit on Stack Overflow

  • i64x2.gt_u
    • y = i64x2.gt_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_tmp, xmm_b
      • MOVDQA xmm_y, xmm_b
      • PSUBQ xmm_tmp, xmm_a
      • PXOR xmm_y, xmm_a
      • PANDN xmm_y, xmm_tmp
      • MOVDQA xmm_tmp, xmm_b
      • PANDN xmm_tmp, xmm_a
      • POR xmm_y, xmm_tmp
      • PSRAD xmm_y, 31
      • PSHUFD xmm_y, xmm_y, 0xF5
  • i64x2.lt_u
    • y = i64x2.lt_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_tmp, xmm_a
      • MOVDQA xmm_y, xmm_b
      • PSUBQ xmm_tmp, xmm_b
      • PXOR xmm_y, xmm_b
      • PANDN xmm_y, xmm_tmp
      • MOVDQA xmm_tmp, xmm_a
      • PANDN xmm_tmp, xmm_b
      • POR xmm_y, xmm_tmp
      • PSRAD xmm_y, 31
      • PSHUFD xmm_y, xmm_y, 0xF5
  • i64x2.ge_u
    • y = i64x2.ge_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_tmp, xmm_a
      • MOVDQA xmm_y, xmm_b
      • PSUBQ xmm_tmp, xmm_b
      • PXOR xmm_y, xmm_b
      • PANDN xmm_y, xmm_tmp
      • MOVDQA xmm_tmp, xmm_a
      • PANDN xmm_tmp, xmm_b
      • POR xmm_y, xmm_tmp
      • PSRAD xmm_y, 31
      • PSHUFD xmm_y, xmm_y, 0xF5
      • PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
  • i64x2.le_u
    • y = i64x2.le_u(a, b) (y is not a and y is not b) is lowered to:
      • MOVDQA xmm_tmp, xmm_b
      • MOVDQA xmm_y, xmm_b
      • PSUBQ xmm_tmp, xmm_a
      • PXOR xmm_y, xmm_a
      • PANDN xmm_y, xmm_tmp
      • MOVDQA xmm_tmp, xmm_b
      • PANDN xmm_tmp, xmm_a
      • POR xmm_y, xmm_tmp
      • PSRAD xmm_y, 31
      • PSHUFD xmm_y, xmm_y, 0xF5
      • PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]

ARM64 processors

  • i64x2.gt_u
    • y = i64x2.gt_u(a, b) is lowered to CMHI Vy.2D, Va.2D, Vb.2D
  • i64x2.lt_u
    • y = i64x2.lt_u(a, b) is lowered to CMHI Vy.2D, Vb.2D, Va.2D
  • i64x2.ge_u
    • y = i64x2.ge_u(a, b) is lowered to CMHS Vy.2D, Va.2D, Vb.2D
  • i64x2.le_u
    • y = i64x2.le_u(a, b) is lowered to CMHS Vy.2D, Vb.2D, Va.2D

ARMv7 processors with NEON instruction set

  • i64x2.gt_u
    • y = i64x2.gt_u(a, b) is lowered to:
      • VQSUB.U64 Qy, Qa, Qb
      • VCGT.U32 Qy, Qy, 0
      • VREV64.32 Qtmp, Qy
      • VAND Qy, Qy, Qtmp
  • i64x2.lt_u
    • y = i64x2.lt_u(a, b) is lowered to:
      • VQSUB.U64 Qy, Qb, Qa
      • VCGT.U32 Qy, Qy, 0
      • VREV64.32 Qtmp, Qy
      • VAND Qy, Qy, Qtmp
  • i64x2.ge_u
    • y = i64x2.ge_u(a, b) is lowered to:
      • VQSUB.U64 Qy, Qb, Qa
      • VCEQ.I32 Qy, Qy, 0
      • VREV64.32 Qtmp, Qy
      • VAND Qy, Qy, Qtmp
  • i64x2.le_u
    • y = i64x2.le_u(a, b) is lowered to:
      • VQSUB.U64 Qy, Qa, Qb
      • VCEQ.I32 Qy, Qy, 0
      • VREV64.32 Qtmp, Qy
      • VAND Qy, Qy, Qtmp

@abrown
Copy link
Contributor

abrown commented Jan 11, 2021

See #412 (comment).

@dtig
Copy link
Member

dtig commented Jan 25, 2021

Adding a preliminary vote for the inclusion of i64x2 unsigned comparison operations to the SIMD proposal below. Please vote with -

👍 For including i64x2 unsigned comparison operations
👎 Against including i64x2 unsigned comparison operations

@ngzhian ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021
@dtig dtig added needs discussion Proposal with an unclear resolution and removed 2021-01-29 Agenda for sync meeting 1/29/21 labels Feb 2, 2021
@dtig
Copy link
Member

dtig commented Mar 5, 2021

Closing as per #436.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs discussion Proposal with an unclear resolution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants