Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Floating-point rounding instructions #232

Merged
merged 5 commits into from
Sep 11, 2020
Merged

Conversation

Maratyszcza
Copy link
Contributor

@Maratyszcza Maratyszcza commented May 19, 2020

Introduction

Floating-point round-to-integer is a widely used operation, available in many software and hardware specifications:

  • As f32.nearest/f32.trunc/f32.ceil/f32.floor/f64.nearest/f64.trunc/f64.ceil/f64.floor scalar instruction in WebAssembly
  • As rint/nearbyint/trunc/ceil/floor functions in C and C++
  • As ROUNDPS and ROUNDPD instructions in SSE4.1
  • As VRINTN/VRINTZ/VRINTP/VRINTM instructions in ARMv8 AArch32
  • As FRINTN/FRINTZ/FRINTP/FRINTM instructions in AArch64

These PR introduce the rounding instructions in WebAssembly SIMD.

New instructions

  • Round to nearest integer, ties to even: f32x4.nearest/f64x2.nearest
  • Round to integer towards zero (truncate to integer): f32x4.trunc/f64x2.trunc
  • Round to integer above (ceiling): f32x4.ceil/f64x2.ceil
  • Round to integer below (floor): f32x4.floor/f64x2.floor

The instructions match the scalar WebAssembly analogs both in names and in semantics.

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX instruction set

  • f32x4.nearest
    • y = f32x4.nearest(x) is lowered to VROUNDPS xmm_y, xmm_x, 0x08
  • f32x4.trunc
    • y = f32x4.trunc(x) is lowered to VROUNDPS xmm_y, xmm_x, 0x0B
  • f32x4.ceil
    • y = f32x4.ceil(x) is lowered to VROUNDPS xmm_y, xmm_x, 0x0A
  • f32x4.floor
    • y = f32x4.floor(x) is lowered to VROUNDPS xmm_y, xmm_x, 0x09
  • f64x2.nearest
    • y = f64x2.nearest(x) is lowered to VROUNDPD xmm_y, xmm_x, 0x08
  • f64x2.trunc
    • y = f64x2.trunc(x) is lowered to VROUNDPD xmm_y, xmm_x, 0x0B
  • f64x2.ceil
    • y = f64x2.ceil(x) is lowered to VROUNDPD xmm_y, xmm_x, 0x0A
  • f64x2.floor
    • y = f64x2.floor(x) is lowered to VROUNDPD xmm_y, xmm_x, 0x09

x86/x86-64 processors with SSE4.1 instruction set

  • f32x4.nearest
    • y = f32x4.nearest(x) is lowered to ROUNDPS xmm_y, xmm_x, 0x08
  • f32x4.trunc
    • y = f32x4.trunc(x) is lowered to ROUNDPS xmm_y, xmm_x, 0x0B
  • f32x4.ceil
    • y = f32x4.ceil(x) is lowered to ROUNDPS xmm_y, xmm_x, 0x0A
  • f32x4.floor
    • y = f32x4.floor(x) is lowered to ROUNDPS xmm_y, xmm_x, 0x09
  • f64x2.nearest
    • y = f64x2.nearest(x) is lowered to ROUNDPD xmm_y, xmm_x, 0x08
  • f64x2.trunc
    • y = f64x2.trunc(x) is lowered to ROUNDPD xmm_y, xmm_x, 0x0B
  • f64x2.ceil
    • y = f64x2.ceil(x) is lowered to ROUNDPD xmm_y, xmm_x, 0x0A
  • f64x2.floor
    • y = f64x2.floor(x) is lowered to ROUNDPD xmm_y, xmm_x, 0x09

x86/x86-64 processors with SSE2 instruction set

  • f32x4.nearest
    • y = f32x4.nearest(x) (y is NOT x) is lowered to:
      • MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)
      • CVTPS2DQ xmm_y, xmm_x
      • CVTDQ2PS xmm_tmp1, xmm_y
      • PCMPEQD xmm_y, xmm_tmp0
      • POR xmm_y, xmm_tmp0
      • ADDPS xmm_tmp0, xmm_x
      • ANDPS xmm_tmp0, xmm_y
      • ANDNPS xmm_y, xmm_tmp1
      • ORPS xmm_y, xmm_tmp0
  • f32x4.trunc
    • y = f32x4.trunc(x) (y is NOT x) is lowered to:
      • MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)
      • CVTTPS2DQ xmm_y, xmm_x
      • CVTDQ2PS xmm_tmp1, xmm_y
      • PCMPEQD xmm_y, xmm_tmp0
      • POR xmm_y, xmm_tmp0
      • ADDPS xmm_tmp0, xmm_x
      • ANDPS xmm_tmp0, xmm_y
      • ANDNPS xmm_y, xmm_tmp1
      • ORPS xmm_y, xmm_tmp0
  • f32x4.ceil
    • x = f32x4.ceil(x) is lowered to:
      • CVTTPS2DQ xmm_tmp0, xmm_x
      • MOVDQA xmm_tmp1, wasm_splat_u32(0x80000000)
      • CVTDQ2PS xmm_tmp2, xmm_tmp0
      • PCMPEQD xmm_tmp0, xmm_tmp1
      • POR xmm_tmp0, xmm_tmp1
      • MOVDQA xmm_tmp3, xmm_tmp0
      • ANDPS xmm_tmp3, xmm_x
      • ANDNPS xmm_tmp0, xmm_tmp2
      • ORPS xmm_tmp0, xmm_tmp3
      • CMPLEPS xmm_x, xmm_tmp0
      • ORPS xmm_x, xmm_tmp1
      • MOVAPS xmm_tmp2, xmm_x
      • ANDPS xmm_tmp2, xmm_tmp0
      • ADDPS xmm_tmp0, wasm_splat_f32(1.0f)
      • ANDNPS xmm_x, xmm_tmp0
      • ORPS xmm_x, xmm_tmp2
  • f32x4.floor
    • y = f32x4.floor(x) (y is NOT x) is lowered to:
      • MOVDQA xmm_tmp0, wasm_splat_u32(0x80000000)
      • CVTTPS2DQ xmm_y, xmm_x
      • CVTDQ2PS xmm_tmp1, xmm_y
      • PCMPEQD xmm_y, xmm_tmp0
      • POR xmm_y, xmm_tmp0
      • MOVAPS xmm_tmp0, xmm_y
      • ANDPS xmm_tmp0, xmm_x
      • ANDNPS xmm_y, xmm_tmp1
      • MOVAPS xmm_tmp1, xmm_x
      • ORPS xmm_y, xmm_tmp0
      • CMPLTPS xmm_tmp1, xmm_y
      • ANDPS xmm_tmp1, wasm_splat_f32(1.0f)
      • SUBPS xmm_y, xmm_tmp1
  • f64x2.nearest
    • y = f64x2.nearest(x) (y is NOT x) is lowered to:
      • MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
      • MOVAPS xmm_y, xmm_x
      • MOVAPS xmm_tmp1, wasm_splat_f64(0x1.0p+52)
      • MOVAPS xmm_tmp2, xmm_tmp0
      • ANDPS xmm_y, xmm_tmp1
      • CMPLEPD xmm_tmp2, xmm_y
      • ADDPD xmm_y, xmm_tmp0
      • SUBPD xmm_y, xmm_tmp0
      • ANDNPS xmm_tmp2, xmm_tmp1
      • MOVAPS xmm_tmp1, xmm_tmp2
      • ANDNPS xmm_tmp1, xmm_x
      • ANDPS xmm_y, xmm_tmp2
      • ORPS xmm_y, xmm_tmp1
  • f64x2.trunc
    • y = f64x2.trunc(x) (y is NOT x) is lowered to:
      • MOVAPS xmm_y, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
      • MOVAPS xmm_tmp0, wasm_splat_f64(0x1.0p+52)
      • MOVAPS xmm_tmp1, xmm_x
      • ANDPS xmm_tmp1, xmm_y
      • MOVAPS xmm_tmp2, xmm_tmp0
      • CMPNLEPD xmm_tmp2, xmm_tmp1
      • ANDPS xmm_y, xmm_tmp2
      • MOVAPS xmm_tmp2, xmm_tmp1
      • ADDPD xmm_tmp2, xmm_tmp0
      • SUBPD xmm_tmp2, xmm_tmp0
      • CMPLTPD xmm_tmp1, xmm_tmp2
      • ANDPS xmm_tmp1, wasm_splat_f64(1.0)
      • SUBPD xmm_tmp2, xmm_tmp1
      • ANDPS xmm_tmp2, xmm_y
      • ANDNPS xmm_y, xmm_x
      • ORPS xmm_y, xmm_tmp2
  • f64x2.ceil
    • y = f64x2.ceil(x) (y is NOT x) is lowered to:
      • MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
      • MOVAPS xmm_y, xmm_x
      • MOVAPS xmm_tmp1, wasm_splat_f64(0x1.0p+52)
      • ANDPS xmm_y, xmm_tmp0
      • MOVAPS xmm_tmp2, xmm_tmp1
      • CMPNLEPD xmm_tmp2, xmm_y
      • ADDPD xmm_y, xmm_tmp1
      • ANDPS xmm_tmp2, xmm_tmp0
      • SUBPD xmm_y, xmm_tmp1
      • ANDPS xmm_y, xmm_tmp2
      • ANDNPS xmm_tmp2, xmm_x
      • ORPS xmm_tmp2, xmm_y
      • MOVAPS xmm_y, xmm_tmp2
      • MOVAPS xmm_tmp1, xmm_tmp2
      • CMPLTPD xmm_y, xmm_x
      • ADDPD xmm_tmp1, wasm_splat_f64(1.0)
      • ANDPS xmm_y, xmm_tmp0
      • ANDPS xmm_tmp1, xmm_y
      • ANDNPS xmm_y, xmm_tmp2
      • ORPS xmm_y, xmm_tmp1
  • f64x2.floor
    • y = f64x2.floor(x) (y is NOT x) is lowered to:
      • MOVAPS xmm_tmp0, wasm_splat_u64(0x7FFFFFFFFFFFFFFF)
      • MOVAPS xmm_tmp1, xmm_x
      • MOVAPS xmm_tmp2, wasm_splat_f64(0x1.0p+52)
      • ANDPS xmm_tmp1, xmm_tmp0
      • MOVAPS xmm_y, xmm_tmp2
      • CMPNLEPD xmm_y, xmm_tmp1
      • ANDPS xmm_y, xmm_tmp0
      • ADDPD xmm_tmp1, xmm_tmp2
      • SUBPD xmm_tmp1, xmm_tmp2
      • ANDPS xmm_tmp1, xmm_y
      • ANDNPS xmm_y, xmm_x
      • MOVAPS xmm_tmp0, xmm_x
      • ORPS xmm_y, xmm_tmp1
      • CMPLTPD xmm_tmp0, xmm_y
      • ANDPS xmm_tmp0, wasm_splat_f64(1.0)
      • SUBPD xmm_y, xmm_tmp0

ARM64 processors

  • f32x4.nearest
    • y = f32x4.nearest(x) is lowered to FRINTN Vy.4S, Vx.4S
  • f32x4.trunc
    • y = f32x4.trunc(x) is lowered to FRINTZ Vy.4S, Vx.4S
  • f32x4.ceil
    • y = f32x4.ceil(x) is lowered to FRINTP Vy.4S, Vx.4S
  • f32x4.floor
    • y = f32x4.floor(x) is lowered to FRINTM Vy.4S, Vx.4S
  • f64x2.nearest
    • y = f64x2.nearest(x) is lowered to FRINTN Vy.2D, Vx.2D
  • f64x2.trunc
    • y = f64x2.trunc(x) is lowered to FRINTZ Vy.2D, Vx.2D
  • f64x2.ceil
    • y = f64x2.ceil(x) is lowered to FRINTP Vy.2D, Vx.2D
  • f64x2.floor
    • y = f64x2.floor(x) is lowered to FRINTM Vy.2D, Vx.2D

ARM processors with ARMv8 (32-bit) instruction set

  • f32x4.nearest
    • y = f32x4.nearest(x) is lowered to VRINTN.F32 Qy, Qx
  • f32x4.trunc
    • y = f32x4.trunc(x) is lowered to VRINTZ.F32 Qy, Qx
  • f32x4.ceil
    • y = f32x4.ceil(x) is lowered to VRINTP.F32 Qy, Qx
  • f32x4.floor
    • y = f32x4.floor(x) is lowered to VRINTM.F32 Qy, Qx
  • f64x2.nearest
    • y = f64x2.nearest(x) is lowered to VRINTN.F64 Dy_lo, Dx_lo + VRINTN.F64 Dy_hi, Dx_hi
  • f64x2.trunc
    • y = f64x2.trunc(x) is lowered to VRINTZ.F64 Dy_lo, Dx_lo + VRINTZ.F64 Dy_hi, Dx_hi
  • f64x2.ceil
    • y = f64x2.ceil(x) is lowered to VRINTP.F64 Dy_lo, Dx_lo + VRINTP.F64 Dy_hi, Dx_hi
  • f64x2.floor
    • y = f64x2.floor(x) is lowered to VRINTM.F64 Dy_lo, Dx_lo + VRINTM.F64 Dy_hi, Dx_hi

ARM processors with ARMv7 (32-bit) instruction set

  • f32x4.nearest
    • y = f32x4.nearest(x) (y is NOT x) is lowered to:
      • VMOV.I32 Qtmp0, 0x4B000000
      • VABS.F32 Qtmp1, Qx
      • VACGT.F32 Qy, Qx, Qtmp0
      • VADD.F32 Qtmp1, Qtmp1, Qtmp0
      • VORR.I32 Qy, 0x80000000
      • VSUB.F32 Qtmp1, Qtmp1, Qtmp0
      • VBSL Qy, Qx, Qtmp1
  • f32x4.trunc
    • y = f32x4.trunc(x) (y is NOT x) is lowered to:
      • VCVT.S32.F32 Qtmp0, Qx
      • VMOV.I32 Qtmp1, 0x4B000000
      • VACGT.F32 Qy, Qtmp1, Qx
      • VCVT.F32.S32 Qtmp0, Qtmp0
      • VBIC.I32 Qy, 0x80000000
      • VBSL Qy, Qtmp0, Qx
  • f32x4.ceil
    • y = f32x4.ceil(x) (y is NOT x) is lowered to:
      • VCVT.S32.F32 Qtmp0, Qx
      • VMOV.I32 Qtmp1, 0x4B000000
      • VACGT.F32 Qtmp1, Qtmp1, Qx
      • VCVT.F32.S32 Qtmp0, Qtmp0
      • VBIC.I32 Qtmp1, 0x80000000
      • VBSL Qtmp1, Qtmp0, Qx
      • VMOV.F32 Qtmp0, 0x3F800000
      • VCGE.F32 Qy, Qtmp1, Qx
      • VADD.F32 Qtmp0, Qtmp1, Qtmp0
      • VORR.I32 Qy, 0x80000000
      • VBSL Qy, Qtmp1, Qtmp0
  • f32x4.floor
    • y = f32x4.floor(x) (y is NOT x) is lowered to:
      • VCVT.S32.F32 Qtmp0, Qx
      • VMOV.I32 Qtmp1, 0x4B000000
      • VACGT.F32 Qy, Qtmp1, Qx
      • VCVT.F32.S32 Qtmp0, Qtmp0
      • VBIC.I32 Qy, 0x80000000
      • VBSL Qy, Qtmp0, Qx
      • VMOV.F32 Qtmp1, 0x3F800000
      • VCGT.F32 Qtmp0, Qy, Qx
      • VAND Qtmp0, Qtmp0, Qtmp1
      • VSUB.F32 Qy, Qy, Qtmp0
  • f64x2.round
    • y = f64x2.round(x) (y is NOT x) is lowered to:
      • VABS.F64 Dy_lo, Dx_lo
      • VABS.F64 Dy_hi, Dx_hi
      • VLDR Dtmp0, 0x1.0p+52
      • VSUB.F64 Dtmp1_lo, Dtmp0, Dy_lo
      • VSUB.F64 Dtmp1_hi, Dtmp0, Dy_hi
      • VADD.F64 Dtmp2_lo, Dy_lo, Dtmp0
      • VADD.F64 Dtmp2_hi, Dy_hi, Dtmp0
      • VEOR Qy, Qx, Qy
      • VSHR.S64 Qtmp1, Qtmp1, 63
      • VSUB.F64 Dtmp2_lo, Dtmp2_lo, Dtmp0
      • VSUB.F64 Dtmp2_hi, Dtmp2_hi, Dtmp0
      • VORR Qy, Qy, Qtmp1
      • VBSL Qy, Qx, Qtmp2
  • f64x2.trunc
    • y = f64x2.trunc(x) (y is NOT x) is lowered to:
      • VLDR Dtmp0, 0x1.0p+52
      • VABS.F64 Qy_lo, Dx_lo
      • VABS.F64 Qy_hi, Dx_hi
      • VADD.F64 Dtmp1_lo, Qy_lo, Dtmp0
      • VADD.F64 Dtmp1_hi, Qy_hi, Dtmp0
      • VSUB.F64 Dtmp2_lo, Dtmp0, Qy_lo
      • VSUB.F64 Dtmp2_hi, d9, Qy_hi
      • VEOR Qtmp3, Qy, Qx
      • VSUB.F64 Dtmp1_lo, Dtmp1_lo, Dtmp0
      • VSUB.F64 Dtmp1_hi, Dtmp1_hi, d9
      • VLDR Dtmp0, 1.0
      • VSHR.S64 Qtmp2, Qtmp2, 63
      • VORR Qtmp3, Qtmp3, Qtmp2
      • VSUB.I64 Qy, Qy, Qtmp1
      • VSHR.S64 Qy, Qy, 63
      • VAND Qy_lo, Qy_lo, Dtmp0
      • VAND Qy_hi, Qy_hi, Dtmp0
      • VSUB.F64 Qy_lo, Dtmp1_lo, Qy
      • VSUB.F64 Qy_hi, Dtmp1_hi, Qx
      • VBIT Qy, Qx, Qtmp3
  • f64x2.ceil
    • y = f64x2.ceil(x) (y is NOT x) is lowered to:
      • VLDR Dtmp0, 0x1.0p+52
      • VABS.F64 Dtmp1_lo, Dx_lo
      • VABS.F64 Dtmp1_hi, Dx_hi
      • VSUB.F64 Dtmp2_lo, Dtmp0, Dtmp1_lo
      • VSUB.F64 Dtmp2_hi, Dtmp0, Dtmp1_hi
      • VADD.F64 Dtmp3_lo, Dtmp1_lo, Dtmp0
      • VADD.F64 Dtmp3_hi, Dtmp1_hi, Dtmp0
      • VEOR Qtmp1, Qtmp1, Qx
      • VSHR.S64 Qtmp2, Qtmp2, 63
      • VSUB.F64 Dtmp3_lo, Dtmp3_lo, Dtmp0
      • VSUB.F64 Dtmp3_hi, Dtmp3_hi, Dtmp0
      • VLDR Dtmp0, 1.0
      • VORR Qtmp2, Qtmp2, Qtmp1
      • VBSL Qtmp2, Qx, Qtmp3
      • VSUB.F64 Dy_lo, Dtmp2_lo, Dx_lo
      • VSUB.F64 Dy_hi, Dtmp2_hi, Dx_hi
      • VADD.F64 Dtmp3_lo, Dtmp2_lo, Dtmp0
      • VADD.F64 Dtmp3_hi, Dtmp2_hi, Dtmp0
      • VSHR.S64 Qy, Qy, 63
      • VBIC Qy, Qy, Qtmp1
      • VBSL Qy, Qtmp3, Qtmp2
  • f64x2.floor
    • y = f64x2.floor(x) (y iD NOT x) iD lowereQ to:
      • VLDR Dtmp0, 0x1.0p+52
      • VABS.F64 Dy_lo, Dx_lo
      • VABS.F64 Dy_hi, Dx_hi
      • VADD.F64 Dtmp1_lo, Dy_lo, Dtmp0
      • VADD.F64 Dtmp1_hi, Dy_hi, Dtmp0
      • VSUB.I64 Dtmp2_lo, Dtmp0, Dy_lo
      • VSUB.I64 Dtmp2_hi, Dtmp0, Dy_hi
      • VEOR Qy, Qy, Qx
      • VSUB.F64 Dtmp1_lo, Dtmp1_lo, Dtmp0
      • VSUB.F64 Dtmp1_hi, Dtmp1_hi, Dtmp0
      • VLDR Dtmp0, 1.0
      • VSHR.S64 Qtmp2, Qtmp2, 63
      • VORR Qy, Qy, Qtmp2
      • VBSL Qy, Qx, Qtmp1
      • VSUB.F64 Dx_lo, Dx_lo, Dy_lo
      • VSUB.F64 Dx_hi, Dx_hi, Dy_hi
      • VSHR.S64 Qtmp2, Qx, 63
      • VAND Dtmp2_lo, Dtmp2_lo, Dtmp0
      • VAND Dtmp2_hi, Dtmp2_hi, Dtmp0
      • VSUB.F64 Dy_lo, Dy_lo, Dtmp2_lo
      • VSUB.F64 Dy_hi, Dy_hi, Dtmp2_hi

@dtig dtig mentioned this pull request May 20, 2020
@dtig dtig linked an issue May 20, 2020 that may be closed by this pull request
@tlively
Copy link
Member

tlively commented May 21, 2020

Yikes, the new numbering only has room for one rounding instruction. We'll have to figure out what to do about that in the long term. Meanwhile, @dtig and @ngzhian do you have a preference about which opcodes to use to prototype this?

@ngzhian
Copy link
Member

ngzhian commented May 21, 2020

No preferences for prototyping, we can probably squeeze them into
0xdc-0xdf
0xe2, 0xee, 0xf8, 0xf9
for now.

@dtig
Copy link
Member

dtig commented May 21, 2020

No strong preferences either, it's somewhat awkward, but we could also do something in the range of 0xc2- 0xca if contiguous opcodes make this simpler, because I don't see the 64x2 AnyTrue/AllTrue and the widen/narrowing instructions to be relevant for 64x2 operations going forward.

If we do have to spill over, it's not terrible but we can make that call when we decide to move past prototyping.

@richgel999
Copy link

richgel999 commented May 28, 2020

These instructions aren't optional IMO. They're fundamental operations. Having to emulate them will be quite painful for many SIMD/SPMD kernels and vectorized math functions.

I have a Perlin noise kernel that computes 24 floors per output pixel:
https://t.co/u9w35T6oTq?amp=1

In another example, I have a vectorized approximate math library. It can compute vectorized tan, sin, cos, log, exp, etc. It uses floor and round for range reduction:
https://t.co/3JlYyZ2oMI?amp=1

Without efficient round/floor/trunc, WebAssembly SIMD will be in the same position SSE2 is relative to SSE4.1. When we execute kernels on SSE2, we commonly get a 15-20% reduction in performance due to having to emulate round/floor/trunc on some kernels, or if they call sin/cos/tan/etc. These are very important operations.

I am currently porting CppSPMD_Fast to WebAssembly, and the lack of efficient round/floor/trunc is going to hurt some kernels by quite a bit. I should have it up and running in 2-3 days.

@zeux
Copy link
Contributor

zeux commented May 29, 2020

Worth noting is that the common way to emulate round/floor/trunc includes conversions back & forth to integers (obviously this is application-dependent as it assumes a specific range and is typically non-IEEE compliant for some operations); however, due to #173 this workaround is going to be slow.

If the inputs are known to be within a 23-bit integer range or thereabouts, floating point addition can be abused to round, and it's probably possible to implement floor etc. in a similar fashion but that route doesn't seems like one we would want to recommend.

@Marc-B-Reynolds
Copy link

Marc-B-Reynolds commented May 29, 2020

If the inputs are known to be within a 23-bit integer range or thereabouts, floating point addition can be abused to round, and it's probably possible to implement floor etc. in a similar fashion but that route doesn't seems like one we would want to recommend.

Worth nothing that this stops working if FP rules are relaxed: (x+K)-K x-formed to x

@ngzhian
Copy link
Member

ngzhian commented May 29, 2020

@Maratyszcza any suggestions for ARM v7 instruction sequence? It will probably look a lot like the x86 SSE2 one?

SIMD equivalents of the nearest/trunc/ceil/floor instructions
@Maratyszcza
Copy link
Contributor Author

Updated opcodes post-renumbering, put into 0xd8-0xdf range

@Maratyszcza
Copy link
Contributor Author

Mapping to SSE2 is finished. @ngzhian ARMv7 NEON is quite different, because of its unique features:

  • Compare absolute values instruction
  • Single-instruction bitwise selection (VBSL/VBIT/VBIF)
  • Bitwise OR and bitwise AND instructions with immediate values

@Maratyszcza
Copy link
Contributor Author

Added ARMv7 NEON mapping for f32 instructions

@ngzhian
Copy link
Member

ngzhian commented Jun 1, 2020

There's some magic going on there. Thanks Marat!

@Maratyszcza Maratyszcza changed the title [WIP] Floating-point rounding instructions Floating-point rounding instructions Jun 2, 2020
@Maratyszcza
Copy link
Contributor Author

All instructions mappings are finished, and PR is ready for review

Copy link
Member

@tlively tlively left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to change the order of instructions to be consistent with their corresponding MVP intructions.

proposals/simd/BinarySIMD.md Outdated Show resolved Hide resolved
proposals/simd/ImplementationStatus.md Outdated Show resolved Hide resolved
proposals/simd/ImplementationStatus.md Outdated Show resolved Hide resolved
proposals/simd/SIMD.md Outdated Show resolved Hide resolved
tlively added a commit to tlively/binaryen that referenced this pull request Jun 4, 2020
tlively added a commit to WebAssembly/binaryen that referenced this pull request Jun 5, 2020
@dtig
Copy link
Member

dtig commented Jun 9, 2020

Thanks @Maratyszcza for filing the issues, moving this to prototyping as on all platforms that we are using as a baseline currently these have a direct mapping to instructions, and on ARMv7, there is a precedent for them being slow as this is the case for the scalar versions of these operations as well, some implementations call out to the runtime to implement them. Moving to pending prototype data as we are prototyping them in V8, adding a retroactive label update.

tlively added a commit to llvm/llvm-project that referenced this pull request Jun 9, 2020
Summary:
As specified in WebAssembly/simd#232. These
instructions are implemented as LLVM intrinsics for now rather than
normal ISel patterns to make these instructions opt-in. Once the
instructions are merged to the spec proposal, the intrinsics will be
replaced with proper ISel patterns.

Reviewers: aheejin

Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits

Tags: #clang, #llvm

Differential Revision: https://reviews.llvm.org/D81222
@tlively
Copy link
Member

tlively commented Jun 11, 2020

These will be available in the next version of Emscripten via __builtin_wasm_{ceil,floor,trunc,nearest}_{f32x4,f64x2}.

@ngzhian
Copy link
Member

ngzhian commented Jun 16, 2020

Prototype in V8 is done for x64, ia32, ARM64. Still working on ARM.
Update: 2020-06-30, prototype on ARM is done as of https://crrev.com/8e54afbe2499cefbccda7ab8a9786451b57db961

gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2e7ddb00c8f9240e148cf5843b50a7ba7b913351
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Aug 16, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982

UltraBlame original commit: 2d73a015caaa3e70c175172158a6548625dc6da3
@ngzhian
Copy link
Member

ngzhian commented Sep 11, 2020

This has been accepted into the proposal [0] during the sync on 2020-09-04. This LGTM, as it is.

Note, I would like https://github.com/WebAssembly/simd/blob/master/proposals/simd/NewOpcodes.md to be updated too, but it requires more tweaks (since there is a bit of a collision in opcodes for these instructions and the "reserved ones" under i64x2, and also ordering of instructions for presentation). But that's not a big problem, and can be worked on in the future.

[0] https://docs.google.com/document/d/138cF6aOUa9RZC2tOR7AhlIQWdmX5EtpzXRTVDAN3bfo/edit# see "4. Floating point rounding"

Co-authored-by: Thomas Lively <[email protected]>
@ngzhian ngzhian merged commit 8e87db7 into WebAssembly:master Sep 11, 2020
pull bot pushed a commit to Alan-love/v8 that referenced this pull request Sep 15, 2020
Implement f32x4 and f64x2 nearest, trunc, ceil, and floor.

These instructions were accepted into the proposal [0], this change
removes all the ifdefs and todo guarding the prototypes, and moves these
instructions out of the post-mvp flag.

[0] WebAssembly/simd#232

Bug: v8:10906
Change-Id: I44ec21dd09f3bf7cf3cae5d35f70f9d2c178c4e4
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2406547
Commit-Queue: Zhi An Ng <[email protected]>
Reviewed-by: Bill Budge <[email protected]>
Cr-Commit-Position: refs/heads/master@{#69923}
pull bot pushed a commit to p-g-krish/v8 that referenced this pull request Sep 15, 2020
Port 068cf20

Original Commit Message:

    Implement f32x4 and f64x2 nearest, trunc, ceil, and floor.

    These instructions were accepted into the proposal [0], this change
    removes all the ifdefs and todo guarding the prototypes, and moves these
    instructions out of the post-mvp flag.

    [0] WebAssembly/simd#232

[email protected], [email protected], [email protected], [email protected]
BUG=
LOG=N

Change-Id: I02086255f635f1d47586fc74dd754426f6beccb0
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2411675
Reviewed-by: Milad Farazmand <[email protected]>
Reviewed-by: Junliang Yan <[email protected]>
Commit-Queue: Milad Farazmand <[email protected]>
Cr-Commit-Position: refs/heads/master@{#69925}
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this pull request Oct 14, 2020
…status. r=jseward

Background: WebAssembly/simd#232

For all the rounding SIMD instructions:

- remove the internal 'Experimental' opcode suffix in the C++ code
- remove the guard on experimental Wasm instructions in all the C++ decoders
- move the test cases from simd/experimental.js to simd/ad-hack.js

I have checked that current V8 and wasm-tools use the same opcode
mappings.  V8 in turn guarantees the correct mapping for LLVM and
binaryen.

Drive-by bug fix: the test predicate for f64 square root was wrong, it
would round its argument to float.  This did not matter for the test
inputs we had but started to matter when I added more difficult inputs
for testing rounding.

Differential Revision: https://phabricator.services.mozilla.com/D92926
jamienicol pushed a commit to jamienicol/gecko that referenced this pull request Oct 15, 2020
…status. r=jseward

Background: WebAssembly/simd#232

For all the rounding SIMD instructions:

- remove the internal 'Experimental' opcode suffix in the C++ code
- remove the guard on experimental Wasm instructions in all the C++ decoders
- move the test cases from simd/experimental.js to simd/ad-hack.js

I have checked that current V8 and wasm-tools use the same opcode
mappings.  V8 in turn guarantees the correct mapping for LLVM and
binaryen.

Drive-by bug fix: the test predicate for f64 square root was wrong, it
would round its argument to float.  This did not matter for the test
inputs we had but started to matter when I added more difficult inputs
for testing rounding.

Differential Revision: https://phabricator.services.mozilla.com/D92926
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 23, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 23, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 23, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 24, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to julian-seward1/wasmtime that referenced this pull request Oct 26, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
julian-seward1 added a commit to bytecodealliance/wasmtime that referenced this pull request Oct 26, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
ambroff pushed a commit to ambroff/gecko that referenced this pull request Nov 4, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
ambroff pushed a commit to ambroff/gecko that referenced this pull request Nov 4, 2020
Implement some of the experimental SIMD opcodes that are supported by
all of V8, LLVM, and Binaryen, for maximum compatibility with test
content we might be exposed to.  Most/all of these will probably make
it into the spec, as they lead to substantial speedups in some
programs, and they are deterministic.

For spec and cpu mapping details, see:

WebAssembly/simd#122 (pmax/pmin)
WebAssembly/simd#232 (rounding)
WebAssembly/simd#127 (dot product)
WebAssembly/simd#237 (load zero)

The wasm bytecode values used here come from the binaryen changes that
are linked from those tickets, that's the best documentation right
now.  Current binaryen opcode mappings are here:
https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h

Also: Drive-by fix for signatures of vroundss and vroundsd, these are
unary operations and should follow the conventions for these with
src/dest arguments, not src0/src1/dest.

Also: Drive-by fix to add variants of vmovss and vmovsd on x64 that
take Operand source and FloatRegister destination.

Differential Revision: https://phabricator.services.mozilla.com/D85982
cfallin pushed a commit to bytecodealliance/wasmtime that referenced this pull request Nov 30, 2020
…structions

This patch implements, for aarch64, the following wasm SIMD extensions

  Floating-point rounding instructions
  WebAssembly/simd#232

  Pseudo-Minimum and Pseudo-Maximum instructions
  WebAssembly/simd#122

The changes are straightforward:

* `build.rs`: the relevant tests have been enabled

* `cranelift/codegen/meta/src/shared/instructions.rs`: new CLIF instructions
  `fmin_pseudo` and `fmax_pseudo`.  The wasm rounding instructions do not need
  any new CLIF instructions.

* `cranelift/wasm/src/code_translator.rs`: translation into CLIF; this is
  pretty much the same as any other unary or binary vector instruction (for
  the rounding and the pmin/max respectively)

* `cranelift/codegen/src/isa/aarch64/lower_inst.rs`:
  - `fmin_pseudo` and `fmax_pseudo` are converted into a two instruction
    sequence, `fcmpgt` followed by `bsl`
  - the CLIF rounding instructions are converted to a suitable vector
    `frint{n,z,p,m}` instruction.

* `cranelift/codegen/src/isa/aarch64/inst/mod.rs`: minor extension of `pub
  enum VecMisc2` to handle the rounding operations.  And corresponding `emit`
  cases.
@ngzhian
Copy link
Member

ngzhian commented Feb 6, 2021

@tlively this wasn't added to NewOpcodes.md, just fyi in case you are looking at that doc for opcode organization.

@tlively
Copy link
Member

tlively commented Feb 6, 2021

Oh, thanks for point that out. I had indeed missed them.

tlively added a commit that referenced this pull request Feb 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

f32x4 = roundXX(f32x4)?
7 participants