i64x2.gt_u, i64x2.lt_u, i64x2.ge_u, and i64x2.le_u instructions #414

Maratyszcza · 2020-12-29T01:23:55Z

Introduction

This is proposal to add 64-bit variant of existing gt_u, lt_u, ge_u, and le_u instructions. ARM64 and x86-64 XOP natively support these instructions, but on other instruction sets they need to be emulated. On SSE4.2 emulation costs 5-6 instructions, but on older SSE extension and on ARMv7 NEON the emulation cost is more significant.

Applications

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512F, AVX512DQ, and AVX512VL instruction sets

i64x2.gt_u
- y = i64x2.gt_u(a, b) is lowered to:
  - VPCMPUQ k_tmp, xmm_a, xmm_b, 6
  - VPMOVM2Q xmm_y, k_tmp
i64x2.lt_u
- y = i64x2.lt_u(a, b) is lowered to:
  - VPCMPUQ k_tmp, xmm_a, xmm_b, 1
  - VPMOVM2Q xmm_y, k_tmp
i64x2.ge_u
- y = i64x2.ge_u(a, b) is lowered to:
  - VPCMPUQ k_tmp, xmm_a, xmm_b, 5
  - VPMOVM2Q xmm_y, k_tmp
i64x2.le_u
- y = i64x2.le_u(a, b) is lowered to:
  - VPCMPUQ k_tmp, xmm_a, xmm_b, 2
  - VPMOVM2Q xmm_y, k_tmp

x86/x86-64 processors with XOP instruction set

i64x2.gt_u
- y = i64x2.gt_u(a, b) is lowered to VPCOMGTUQ xmm_y, xmm_a, xmm_b
i64x2.lt_u
- y = i64x2.lt_u(a, b) is lowered to VPCOMLTUQ xmm_y, xmm_a, xmm_b
i64x2.ge_u
- y = i64x2.ge_u(a, b) is lowered to VPCOMGEUQ xmm_y, xmm_a, xmm_b
i64x2.le_u
- y = i64x2.le_u(a, b) is lowered to VPCOMLEUQ xmm_y, xmm_a, xmm_b

x86/x86-64 processors with AVX instruction set

i64x2.gt_u
- y = i64x2.gt_u(a, b) (y is not b) is lowered to:
  - VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
  - VPXOR xmm_y, xmm_y, xmm_tmp
  - VPXOR xmm_tmp, xmm_b, xmm_tmp
  - VPCMPGTQ xmm_y, xmm_y, xmm_tmp
i64x2.lt_u
- y = i64x2.lt_u(a, b) (y is not b) is lowered to:
  - VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
  - VPXOR xmm_y, xmm_y, xmm_tmp
  - VPXOR xmm_tmp, xmm_b, xmm_tmp
  - VPCMPGTQ xmm_y, xmm_tmp, xmm_y
i64x2.ge_u
- y = i64x2.ge_u(a, b) (y is not b) is lowered to:
  - VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
  - VPXOR xmm_y, xmm_y, xmm_tmp
  - VPXOR xmm_tmp, xmm_b, xmm_tmp
  - VPCMPGTQ xmm_y, xmm_tmp, xmm_y
  - VPXOR xmm_y, xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
i64x2.le_u
- y = i64x2.le_u(a, b) (y is not b) is lowered to:
  - VMOVDQA xmm_tmp, [wasm_i64x2_splat(0x8000000000000000)]
  - VPXOR xmm_y, xmm_y, xmm_tmp
  - VPXOR xmm_tmp, xmm_b, xmm_tmp
  - VPCMPGTQ xmm_y, xmm_y, xmm_tmp
  - VPXOR xmm_y, xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]

x86/x86-64 processors with SSE4.2 instruction set

i64x2.gt_u
- y = i64x2.gt_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
  - MOVDQA xmm_tmp, xmm_y
  - PXOR xmm_y, xmm_a
  - PXOR xmm_tmp, xmm_b
  - PCMPGTQ xmm_y, xmm_tmp
i64x2.lt_u
- y = i64x2.lt_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
  - MOVDQA xmm_tmp, xmm_y
  - PXOR xmm_y, xmm_b
  - PXOR xmm_tmp, xmm_a
  - PCMPGTQ xmm_y, xmm_tmp
i64x2.ge_u
- y = i64x2.ge_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
  - MOVDQA xmm_tmp, xmm_y
  - PXOR xmm_y, xmm_b
  - PXOR xmm_tmp, xmm_a
  - PCMPGTQ xmm_y, xmm_tmp
  - PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
i64x2.le_u
- y = i64x2.le_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_y, [wasm_i64x2_splat(0x8000000000000000)]
  - MOVDQA xmm_tmp, xmm_y
  - PXOR xmm_y, xmm_b
  - PXOR xmm_tmp, xmm_a
  - PCMPGTQ xmm_y, xmm_tmp
  - PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]

x86/x86-64 processors with SSE2 instruction set

Based on this answer by user aqrit on Stack Overflow

i64x2.gt_u
- y = i64x2.gt_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_tmp, xmm_b
  - MOVDQA xmm_y, xmm_b
  - PSUBQ xmm_tmp, xmm_a
  - PXOR xmm_y, xmm_a
  - PANDN xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_b
  - PANDN xmm_tmp, xmm_a
  - POR xmm_y, xmm_tmp
  - PSRAD xmm_y, 31
  - PSHUFD xmm_y, xmm_y, 0xF5
i64x2.lt_u
- y = i64x2.lt_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_tmp, xmm_a
  - MOVDQA xmm_y, xmm_b
  - PSUBQ xmm_tmp, xmm_b
  - PXOR xmm_y, xmm_b
  - PANDN xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_a
  - PANDN xmm_tmp, xmm_b
  - POR xmm_y, xmm_tmp
  - PSRAD xmm_y, 31
  - PSHUFD xmm_y, xmm_y, 0xF5
i64x2.ge_u
- y = i64x2.ge_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_tmp, xmm_a
  - MOVDQA xmm_y, xmm_b
  - PSUBQ xmm_tmp, xmm_b
  - PXOR xmm_y, xmm_b
  - PANDN xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_a
  - PANDN xmm_tmp, xmm_b
  - POR xmm_y, xmm_tmp
  - PSRAD xmm_y, 31
  - PSHUFD xmm_y, xmm_y, 0xF5
  - PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]
i64x2.le_u
- y = i64x2.le_u(a, b) (y is not a and y is not b) is lowered to:
  - MOVDQA xmm_tmp, xmm_b
  - MOVDQA xmm_y, xmm_b
  - PSUBQ xmm_tmp, xmm_a
  - PXOR xmm_y, xmm_a
  - PANDN xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_b
  - PANDN xmm_tmp, xmm_a
  - POR xmm_y, xmm_tmp
  - PSRAD xmm_y, 31
  - PSHUFD xmm_y, xmm_y, 0xF5
  - PXOR xmm_y, [wasm_i32x4_splat(0xFFFFFFFF)]

ARM64 processors

i64x2.gt_u
- y = i64x2.gt_u(a, b) is lowered to CMHI Vy.2D, Va.2D, Vb.2D
i64x2.lt_u
- y = i64x2.lt_u(a, b) is lowered to CMHI Vy.2D, Vb.2D, Va.2D
i64x2.ge_u
- y = i64x2.ge_u(a, b) is lowered to CMHS Vy.2D, Va.2D, Vb.2D
i64x2.le_u
- y = i64x2.le_u(a, b) is lowered to CMHS Vy.2D, Vb.2D, Va.2D

ARMv7 processors with NEON instruction set

i64x2.gt_u
- y = i64x2.gt_u(a, b) is lowered to:
  - VQSUB.U64 Qy, Qa, Qb
  - VCGT.U32 Qy, Qy, 0
  - VREV64.32 Qtmp, Qy
  - VAND Qy, Qy, Qtmp
i64x2.lt_u
- y = i64x2.lt_u(a, b) is lowered to:
  - VQSUB.U64 Qy, Qb, Qa
  - VCGT.U32 Qy, Qy, 0
  - VREV64.32 Qtmp, Qy
  - VAND Qy, Qy, Qtmp
i64x2.ge_u
- y = i64x2.ge_u(a, b) is lowered to:
  - VQSUB.U64 Qy, Qb, Qa
  - VCEQ.I32 Qy, Qy, 0
  - VREV64.32 Qtmp, Qy
  - VAND Qy, Qy, Qtmp
i64x2.le_u
- y = i64x2.le_u(a, b) is lowered to:
  - VQSUB.U64 Qy, Qa, Qb
  - VCEQ.I32 Qy, Qy, 0
  - VREV64.32 Qtmp, Qy
  - VAND Qy, Qy, Qtmp

abrown · 2021-01-11T21:43:00Z

See #412 (comment).

dtig · 2021-01-25T19:31:21Z

Adding a preliminary vote for the inclusion of i64x2 unsigned comparison operations to the SIMD proposal below. Please vote with -

👍 For including i64x2 unsigned comparison operations
👎 Against including i64x2 unsigned comparison operations

dtig · 2021-03-05T02:11:50Z

Closing as per #436.

Maratyszcza mentioned this pull request Jan 5, 2021

Agenda for sync meeting 1/8/2021 #410

Closed

tlively mentioned this pull request Jan 8, 2021

Agenda for sync meeting 1/22/21 #419

Closed

abrown mentioned this pull request Jan 11, 2021

i64x2.gt_s, i64x2.lt_s, i64x2.ge_s, and i64x2.le_s instructions #412

Merged

Maratyszcza force-pushed the cmpgtu-64bit branch 2 times, most recently from 6f965da to c68125f Compare January 19, 2021 20:56

tlively mentioned this pull request Jan 23, 2021

Agenda for sync meeting 1/29/21 #429

Closed

ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021

tlively mentioned this pull request Jan 31, 2021

Agenda for sync meeting 2/5/21 #436

Closed

Maratyszcza force-pushed the cmpgtu-64bit branch from c68125f to df6d3f1 Compare February 1, 2021 16:59

dtig added needs discussion Proposal with an unclear resolution and removed 2021-01-29 Agenda for sync meeting 1/29/21 labels Feb 2, 2021

i64x2.gt_u, i64x2.lt_u, i64x2.ge_u, and i64x2.le_u instructions

0a41d36

Maratyszcza force-pushed the cmpgtu-64bit branch from df6d3f1 to 0a41d36 Compare February 5, 2021 16:50

dtig closed this Mar 5, 2021

alexcrichton mentioned this pull request Mar 24, 2021

Tweak names of wasm SIMD intrinsics rust-lang/stdarch#1096

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i64x2.gt_u, i64x2.lt_u, i64x2.ge_u, and i64x2.le_u instructions #414

i64x2.gt_u, i64x2.lt_u, i64x2.ge_u, and i64x2.le_u instructions #414

Maratyszcza commented Dec 29, 2020 •

edited

Loading

abrown commented Jan 11, 2021

dtig commented Jan 25, 2021

dtig commented Mar 5, 2021

i64x2.gt_u, i64x2.lt_u, i64x2.ge_u, and i64x2.le_u instructions #414

i64x2.gt_u, i64x2.lt_u, i64x2.ge_u, and i64x2.le_u instructions #414

Conversation

Maratyszcza commented Dec 29, 2020 • edited Loading

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512F, AVX512DQ, and AVX512VL instruction sets

x86/x86-64 processors with XOP instruction set

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSE4.2 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

abrown commented Jan 11, 2021

dtig commented Jan 25, 2021

dtig commented Mar 5, 2021

Maratyszcza commented Dec 29, 2020 •

edited

Loading