Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

f32x4 = roundXX(f32x4)? #177

Closed
jan-wassenberg opened this issue Jan 14, 2020 · 9 comments · Fixed by #232
Closed

f32x4 = roundXX(f32x4)? #177

jan-wassenberg opened this issue Jan 14, 2020 · 9 comments · Fixed by #232

Comments

@jan-wassenberg
Copy link

I'm surprised to see the various f32x4 = roundXX(f32x4) functions specified by IEEE not included yet.
Was there some prior discussion about them, or just no need so far?

round (roundToIntegralTiesToEven)
ceil (roundToIntegralTowardPositive)
floor (roundToIntegralTowardNegative)
trunc (roundToIntegralTowardZero)

JPEG XL uses round() for quantization - we find it useful to keep the values in floating-point rather than round while converting to integer. This allows a subsequent FP subtract without (fairly expensive) conversions FP -> int -> FP. I also see uses in Embree and GLM.

These operations are widely supported: on x86, we have _mm_round_ps; on ARM it's vrndn; on PPC it's xvrspi*. Are there any concerns about including them?

@nfrechette
Copy link

It is also used by the Animation Compression Library which I will be porting to wasm in the next few weeks. floor is used to calculate key indices from a sample time and the resulting interpolation alpha between them. Keeping everything in floating point ensures optimal performance.

@Maratyszcza
Copy link
Contributor

I agree that WAsm SIMD should have such instructions: they are supported on recentish processors (on x86 since SSE4.1 and on ARM NEON since ARMv8), and expensive to emulate, especially with float<->int conversion inefficiencies discussed in #173. The polyfill for pre-SSE4.1/ARMv8 ISA would be non-trivial, though.

@jan-wassenberg
Copy link
Author

Agreed, might even require scalar code - that's how we are currently emulating it.

@dtig
Copy link
Member

dtig commented Feb 20, 2020

A version of this was previously discussed in #7 (the v128 result in the signature in that issue is ambiguous), but as there were concerns about efficient implementations across ISAs, and there wasn't an explicit need for them at the time so they weren't included. If there is a real-world (or a micro benchmark that emulates a real world use case) that we can use, that would be helpful in getting some data to make a good case for including this set of operations.

@jan-wassenberg
Copy link
Author

@dtig I've gathered some data. The JPEG XL encoder (Cheetah setting) goes from 19.3 MP/s to 16.1 when round() is replaced with scalar std::round. This 1.2x is a surprisingly high cost, given that we only round once per pixel for quantization.

@tlively
Copy link
Member

tlively commented Mar 28, 2020

What is the status of this issue? @dtig what do you think about the JPEG XL improvement @jan-wassenberg shared?

@tlively tlively added the needs discussion Proposal with an unclear resolution label Mar 28, 2020
@dtig
Copy link
Member

dtig commented Apr 1, 2020

@jan-wassenberg I agree that that's a high cost to pay, was wondering if the 1.2x was observed just on one architecture, or if this was consistent across architectures? If you have some data to share there, and/or links to code samples for the rounding modes in use maybe we can distill this into a PR?

In general I am concerned about the inefficient polyfill aspect that @Maratyszcza mentioned, not so much for pre-SSE4.1, but more for pre-ARMv8 as IIRC for pre-SSE4.1 the real world usage numbers IIRC are very low.

@Maratyszcza
Copy link
Contributor

WebAssembly 1.1 includes the following instructions:

  • f32.trunc
  • f32.nearest
  • f32.ceil
  • f32.floor
  • f64.trunc
  • f64.nearest
  • f64.ceil
  • f64.floor

@dtig
Copy link
Member

dtig commented May 20, 2020

Circling back to this, the only argument against this set of operations was that it's harder to polyfill for older chipsets. Speaking from the perspective of an engine implementer, pre-SSE4.1 code for us isn't really very interesting as we establish SSE4.1 as the baseline for SIMD support on Intel architectures. Looking more closely at ARMv7, it's a hard question to answer as to exactly how much performance matters on these, the other thing to note is also that the scalar versions are also quite inefficient, so including vector versions here seems less of a problem. That said, it would be good to verify that the perf benefits observed by @jan-wassenberg are observed across architectures. As we're okay with prototyping this in V8, marking as pending prototype data for now. Thanks @Maratyszcza for distilling this into #232.

@dtig dtig linked a pull request May 20, 2020 that will close this issue
@dtig dtig added pending prototype data and removed needs discussion Proposal with an unclear resolution labels May 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants