-
Notifications
You must be signed in to change notification settings - Fork 64
Consider introducing a "long SIMD" API #180
Comments
I strongly advise that we synchronize with the C++ standards committee's SG1 concurrency & parallelism sub-group on such SIMD primitives: one of the primary code generators for SIMD.js will be C++ compilers, it would therefore be unfortunate to invent features in a vacuum. We can agree on an ad-hoc SIMD API if it comes from the auto-vectorizer, but probably shouldn't if non-fixed-width SIMD is exposed as a language primitive in both languages. I am in favor of such a SIMD feature, and am happy to continue representing the concerns of virtual ISAs at the C++ standards committee (JavaScript being one virtual ISA I care about as a representative of Chrome). As background, the C++ standards committee has been discussion both fixed-width and non-fixed-width SIMD for a while, and my tealeaf-reading tells me that both approaches will likely make it to technical specifications, but neither will be in C++17. Relevant papers can be found in the 2014 and 2015 mailings (AOL keywords A word of caution: discussion of syntax and library versus language features are intermixed with technical capabilities discussions about SIMD. I believe the TC39 audience won't care for the C++ syntax aspect. The non-fixed-width SIMD approach was recently discussed and the main point of interest I found was using a wavefront model. Of interest may also be the parallelism TS (authored by NVIDIA's Jared Hoberock), which has vector execution policies which would be visible to JavaScript when translating from C++. The executors work is also relevant, but still quite early (the impending post-Lenexa mailing should have more details). |
My proposal above conveniently omitted mention of how the iterations of a SIMD.Long operation might be ordered :). I agree that it's desirable to coordinate with the C++ committee here to see if we can find something that works for both. I think the rest of my sketch can basically be made compatible with the conceptual models in the n4238 paper linked above. Representing some of these ideas within LLVM may even be the bigger challenge, language-wise. |
Just FYI: C# introduced these 'use max available hardware size vector' operations last year: They only allow basic arithmetic and logic operations (+, -, and, or, etc) and they only allow 32 and 64-bit floats and ints as element types. They do expose a .Length() function/property, so they do expect devs to manually write the loops that manipulate the vector data. If the set of operations are limited to the basic arithmetic and logical operators, I don't think the .Length should be exposed. |
128-bit SIMD.js leverages a broad convergence across architectures. SSE through SSE4.2, NEON, Altivec, MSA, all largely lined up at 128-bit SIMD registers, mostly IEEE-754, a lot of commonality in the intersection of operations, and no predication. And it's not an accident; this kind of 128-bit SIMD really is a great fit for many domains.
512-bit SIMD doesn't have this kind of convergence, now or in the foreseeable future. On one hand, one can say that Intel has merely gone further than others, however on the other, one might say that other architectures already do have 512-bit SIMD units and they're called GPUs. With the latter view, extending SIMD.js's fixed-width approach to 512-bit isn't a portable abstraction, because it only handles one architecture.
And on the compiler side, longer SIMD instructions have greater needs for predication, and fixed-length predicate vectors introduce some API ambiguities. To lower an Int1x4 to a 128-bit unpredicated SIMD platform, the representation needed depends on how the value was defined and how it will be used; it might be an int32x4, or it might be two int64x2s. It's true that a clever compiler can often figure out the right thing to do by looking around at context, but since there's only one type, there's nothing ruling out implicit conversions between representations, or uses having reaching definitions of differing representations, so there will always be several weird corner cases to handle. And, there's often a need for calling conventions or in-memory representations with a single representation for a given type.
And, can you convert between a boolean vector and bits packed in a scalar integer? AVX-512 is pushing one way -- in
<avx512fintrin.h>
masks are even directly represented as bits packed into scalar integers -- while on the NEON side, we had to remove the signmask operation from SIMD.js because NEON can't do it efficiently.We should take this opportunity to consider a "long SIMD" approach which could actually be portable and sane across all the 512-bit SIMD and other size SIMD units in use today for "long SIMD" types of use cases.
The primary characteristic of a "long SIMD" API is that it doesn't have a fixed width in the programming model. This is both a strength and a weakness, which is why it complements the "short SIMD" approach rather than being redundant with it.
There are several possible approaches to "long SIMD". I'll sketch out one possible approach here:
(warning, totally rough sketch)
Operations like SIMD.long.add and SIMD.long.mul would be combinators that would form an expression tree that would be evaluated by SIMD.Long.do.
This would be SIMD-width-independent, it would operate on regular TypedArrays (and Shared ones too, when that becomes available), it could be given conditional operators to support predication, and it could have a fairly obvious mapping/polyfill to either scalar or 128-bit SIMD.js, and it could be made to support optimized implementations using AVX-512 or other things. And for problem domains which fit the "long SIMD" model, this style API would be much nicer to use, because it would take care of details like cleaning up when the number of elements in an array isn't a multiple of the SIMD lane count.
Obviously there're a ton of specifics to figure out here, but this style approach has many promising aspects. And, it has a chance at being simple enough to gain better traction in situations where the complexity of solutions like OpenCL are burdensome.
The text was updated successfully, but these errors were encountered: