f32x4 = roundXX(f32x4)? #177

jan-wassenberg · 2020-01-14T10:17:41Z

I'm surprised to see the various f32x4 = roundXX(f32x4) functions specified by IEEE not included yet.
Was there some prior discussion about them, or just no need so far?

round (roundToIntegralTiesToEven)
ceil (roundToIntegralTowardPositive)
floor (roundToIntegralTowardNegative)
trunc (roundToIntegralTowardZero)

JPEG XL uses round() for quantization - we find it useful to keep the values in floating-point rather than round while converting to integer. This allows a subsequent FP subtract without (fairly expensive) conversions FP -> int -> FP. I also see uses in Embree and GLM.

These operations are widely supported: on x86, we have _mm_round_ps; on ARM it's vrndn; on PPC it's xvrspi*. Are there any concerns about including them?

nfrechette · 2020-01-14T15:04:58Z

It is also used by the Animation Compression Library which I will be porting to wasm in the next few weeks. floor is used to calculate key indices from a sample time and the resulting interpolation alpha between them. Keeping everything in floating point ensures optimal performance.

Maratyszcza · 2020-01-20T20:01:29Z

I agree that WAsm SIMD should have such instructions: they are supported on recentish processors (on x86 since SSE4.1 and on ARM NEON since ARMv8), and expensive to emulate, especially with float<->int conversion inefficiencies discussed in #173. The polyfill for pre-SSE4.1/ARMv8 ISA would be non-trivial, though.

jan-wassenberg · 2020-01-21T22:04:57Z

Agreed, might even require scalar code - that's how we are currently emulating it.

dtig · 2020-02-20T08:10:36Z

A version of this was previously discussed in #7 (the v128 result in the signature in that issue is ambiguous), but as there were concerns about efficient implementations across ISAs, and there wasn't an explicit need for them at the time so they weren't included. If there is a real-world (or a micro benchmark that emulates a real world use case) that we can use, that would be helpful in getting some data to make a good case for including this set of operations.

jan-wassenberg · 2020-03-02T10:09:14Z

@dtig I've gathered some data. The JPEG XL encoder (Cheetah setting) goes from 19.3 MP/s to 16.1 when round() is replaced with scalar std::round. This 1.2x is a surprisingly high cost, given that we only round once per pixel for quantization.

tlively · 2020-03-28T01:45:03Z

What is the status of this issue? @dtig what do you think about the JPEG XL improvement @jan-wassenberg shared?

dtig · 2020-04-01T17:27:17Z

@jan-wassenberg I agree that that's a high cost to pay, was wondering if the 1.2x was observed just on one architecture, or if this was consistent across architectures? If you have some data to share there, and/or links to code samples for the rounding modes in use maybe we can distill this into a PR?

In general I am concerned about the inefficient polyfill aspect that @Maratyszcza mentioned, not so much for pre-SSE4.1, but more for pre-ARMv8 as IIRC for pre-SSE4.1 the real world usage numbers IIRC are very low.

Maratyszcza · 2020-05-06T20:07:39Z

WebAssembly 1.1 includes the following instructions:

f32.trunc
f32.nearest
f32.ceil
f32.floor
f64.trunc
f64.nearest
f64.ceil
f64.floor

dtig · 2020-05-20T23:14:38Z

Circling back to this, the only argument against this set of operations was that it's harder to polyfill for older chipsets. Speaking from the perspective of an engine implementer, pre-SSE4.1 code for us isn't really very interesting as we establish SSE4.1 as the baseline for SIMD support on Intel architectures. Looking more closely at ARMv7, it's a hard question to answer as to exactly how much performance matters on these, the other thing to note is also that the scalar versions are also quite inefficient, so including vector versions here seems less of a problem. That said, it would be good to verify that the perf benefits observed by @jan-wassenberg are observed across architectures. As we're okay with prototyping this in V8, marking as pending prototype data for now. Thanks @Maratyszcza for distilling this into #232.

tlively added the needs discussion Proposal with an unclear resolution label Mar 28, 2020

dtig linked a pull request May 20, 2020 that will close this issue

Floating-point rounding instructions #232

Merged

dtig added pending prototype data and removed needs discussion Proposal with an unclear resolution labels May 20, 2020

ngzhian closed this as completed in #232 Sep 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

f32x4 = roundXX(f32x4)? #177

f32x4 = roundXX(f32x4)? #177

jan-wassenberg commented Jan 14, 2020

nfrechette commented Jan 14, 2020

Maratyszcza commented Jan 20, 2020

jan-wassenberg commented Jan 21, 2020

dtig commented Feb 20, 2020

jan-wassenberg commented Mar 2, 2020

tlively commented Mar 28, 2020

dtig commented Apr 1, 2020

Maratyszcza commented May 6, 2020

dtig commented May 20, 2020

f32x4 = roundXX(f32x4)? #177

f32x4 = roundXX(f32x4)? #177

Comments

jan-wassenberg commented Jan 14, 2020

nfrechette commented Jan 14, 2020

Maratyszcza commented Jan 20, 2020

jan-wassenberg commented Jan 21, 2020

dtig commented Feb 20, 2020

jan-wassenberg commented Mar 2, 2020

tlively commented Mar 28, 2020

dtig commented Apr 1, 2020

Maratyszcza commented May 6, 2020

dtig commented May 20, 2020