Make u8x16 and u8x32 have Vector call ABI #589

gnzlbg · 2019-07-01T08:02:34Z

Before this commit, u8x16 and u8x32 were repr(Rust) unions. This introduced
unspecified behavior because the field offsets of repr(Rust) unions are not
guaranteed to be at offset 0, so that field access was potentially UB.

This commit fixes that, and closes #588 .

The unions were also generating a lot of unnecessary memory operations. This
commit fixes that as well.

The issue is that unions have an Aggregate call ABI, which is the same as the
call ABI of arrays. That is, they are passed around by memory, and not in Vector
registers.

This is good, if most of the time one operates on them as arrays. This was,
however, not the case. Most of the operations on these unions are using SIMD
instructions. This means that the union needs to be copied into a SIMD register,
operated on, and then spilled back to the stack, on every single operation.
That's unnecessary, although apparently LLVM was able to optimize all the
unnecessary memory operations away and leave these always in registers.

This commit fixes this issue as well, by making the u8x16 and u8x32
repr(transparent) newtypes over the architecture specific vector types, giving
them the Vector ABI.

The vectors are then copied to the stack only when necessary, and as little as
possible. This is done using mem::transmute, removing the need for unions
altogether (fixing #588 by not having to worry about union layout at all).

To make it clear when the vectors are spilled into the stack, the
vector::replace(index, value) and vector::extract(index) APIs have been removed,
and instead, only a vector::bytes(self) and a vector::from_bytes(&mut self, [u8;
N]) APIs are provided instead. This prevents spilling the vectors back and forth
onto the stack every time an index needs to be modified, by using vector::bytes
to spill the vector to the stack once, making all the random-access
modifications in memory, and then using vector::from_bytes only once to move the
memory back into a SIMD register.

I haven't run any benchmarks, so please do benchmark that this does not introduce any performance regressions before merging this

Before this commit, u8x16 and u8x32 were repr(Rust) unions. This introduced unspecified behavior because the field offsets of repr(Rust) unions are not guaranteed to be at offset 0, so that field access was potentially UB. This commit fixes that, and closes rust-lang#588 . The unions were also generating a lot of unnecessary memory operations. This commit fixes that as well. The issue is that unions have an Aggregate call ABI, which is the same as the call ABI of arrays. That is, they are passed around by memory, and not in Vector registers. This is good, if most of the time one operates on them as arrays. This was, however, not the case. Most of the operations on these unions are using SIMD instructions. This means that the union needs to be copied into a SIMD register, operated on, and then spilled back to the stack, on every single operation. That's unnecessary, although apparently LLVM was able to optimize all the unnecessary memory operations away and leave these always in registers. This commit fixes this issue as well, by making the u8x16 and u8x32 repr(transparent) newtypes over the architecture specific vector types, giving them the Vector ABI. The vectors are then copied to the stack only when necessary, and as little as possible. This is done using mem::transmute, removing the need for unions altogether (fixing rust-lang#588 by not having to worry about union layout at all). To make it clear when the vectors are spilled into the stack, the vector::replace(index, value) and vector::extract(index) APIs have been removed, and instead, only a vector::bytes(self) and a vector::from_bytes(&mut self, [u8; N]) APIs are provided instead. This prevents spilling the vectors back and forth onto the stack every time an index needs to be modified, by using vector::bytes to spill the vector to the stack once, making all the random-access modifications in memory, and then using vector::from_bytes only once to move the memory back into a SIMD register.

src/vector/ssse3.rs

RReverser · 2019-07-01T11:25:42Z

src/vector/avx2.rs

+    }
+
+    #[inline]
+    pub fn replace_bytes(&mut self, value: [u8; 32]) {


Perhaps better to have bytes(&self) -> &[u8; 32] and bytes_mut(&mut self) -> &mut [u8; 32]? It would be more Rust-idiomatic than providing a getter+setter, and in some cases might give a faster way to change separate bytes without replacing the whole thing.

Perhaps better to have bytes(&self) -> &[u8; 32]

Where does that [u8; 32] live ? (same for bytes_mut). Or are you suggesting transmuting &__m256i into a &[u8; 32] ?

The body of the function wouldn't change - you'd still use transmute, just between references and not values.

Here's an example with both implementations and you can see how code for setting a single byte differs between two: https://rust.godbolt.org/z/XF7m95

If the __m256i is in a SIMD register, it doesn't have a memory address, so creating a pointer to it requires spilling it to the stack to be able to give it an address. Once the modification on the stack is done, moving it back into a SIMD register requires copying the whole thing, not a single byte (at least if dynamic indices are involved, if the indices are compile-time constants, for some index values, sometimes, the compiler can do better).

One doesn't want this back-and-forth to happen accidentally, every time one modifies a part of the vector, and returning a &[u8; 32] would encourage that.

The proposed API forces its users to explicitly move the vector contents between the registers and memory. This revealed a couple of places where, e.g., instead of doing this back-and-forth on every iteration of a loop, one can just load the vector into memory once before the loop, operate on memory, and move the contents back into a SIMD register after the loop has completed.

That's a good point. I'd argue that having several .bytes_mut()[...] = ... patterns in a row would also be pretty obvious and most people would store the reference when they could, but I see how your proposed API forces this a bit better.

However, if that's the goal, I wonder if it's worth to go one step further and take mutations as a callback like .with_bytes(|bytes| ...) where one can modify contents in any way they want inside, but then on the type system level you wouldn't be able to do anything else outside, and transformation back to a register would be guaranteed to happen as part of the call?

However, if that's the goal, I wonder if it's worth to go one step further and take mutations as a callback like .with_bytes(|bytes| ...) where one can modify contents in any way they want inside, but then on the type system level you wouldn't be able to do anything else outside, and transformation back to a register would be guaranteed to happen as part of the call?

Some code only wants to read the bytes, so forcing a write would be bad for that code - I've pinged you in one example.

I also don't see really an advantage in restricting access to a scope. If you wanted to mutate two vectors, you would need to nest scopes. For an API that's only intended for internal consumption, this API would feel even more like overengineering than the one I proposed - and I consider mine borderline overengineering.

Okay, fair enough. I'm not 100% comfortable with replace_bytes vs bytes_mut for idiomaticity reasons, but I can see how this is the least of the evils :)

Here's an example with both implementations and you can see how code for setting a single byte differs between two: https://rust.godbolt.org/z/XF7m95

As mentioned, this should be properly benchmarked before being merged. Those examples do not really use the intrinsics, so the code is being generated in isolation without any objective in mind. That is not very representative of what this library actually does. If you actually try to use the APIs, you'll see that they generate the exact same code when all optimizations are turned on: https://rust.godbolt.org/z/Z8sdH9

But this is not about optimizing the implementation, this is about optimizing the amount of work that LLVM has to do to produce efficient code. If you look at the LLVM-IR produced in debug mode: https://rust.godbolt.org/z/sMfxHm the version using replace_bytes produces 134 lines of LLVM-IR, while the version using bytes_mut produces 758 lines of LLVM-IR (if you look at the assembly at opt-level=1, you also see a much better result). There are a couple of factors at play here, but producing 6x more LLVM-IR for this isn't really worth it. LLVM can optimize it without problems, at least in this case where everything is private.

src/literal/teddy_avx2/imp.rs

Closes rust-lang#591.

gnzlbg · 2019-07-03T17:35:43Z

@BurntSushi my benchmarks results are here: https://gist.github.com/gnzlbg/dba8836f80237e0affd63d13d6274a37

Feels like noise. Is there a benchmark that's particularly heavy on the SSE and AVX implementations ?

Maybe someone else can also benchmark this ?

BurntSushi · 2019-07-04T15:40:58Z

Thanks so much for this! The vector/aggregate ABI issue is something I was completely unaware of, and in retrospect, makes a lot of sense. When I first ported the Teddy code to std::arch (which is when I introduced the errant union type punning), I noticed that small perturbations to the code would result in fairly different codegen, in particular, with lots of additional mov instructions. It now seems likely that the root cause of this was the tension you described between the aggregate/vector ABIs. It seems LLVM was really saving my ass here. :P

In any case, I've run the benchmarks myself and everything looks okay. (The benchmarks are unfortunately pretty noisy.) I also did some ad hoc benchmarks by hand with perf and inspected the codegen. The codegen appears to be unchanged, with the exception of one fewer mov instruction, which makes sense given your explanation.

BurntSushi · 2019-07-04T15:41:43Z

This PR is now in regex 1.1.8 on crates.io.

gnzlbg force-pushed the vector_abi branch 2 times, most recently from 3b47633 to 2e33415 Compare July 1, 2019 08:12

gnzlbg force-pushed the vector_abi branch from 2e33415 to 3240a1a Compare July 1, 2019 08:13

RReverser reviewed Jul 1, 2019

View reviewed changes

src/vector/ssse3.rs Outdated Show resolved Hide resolved

RReverser reviewed Jul 1, 2019

View reviewed changes

gnzlbg commented Jul 1, 2019

View reviewed changes

src/literal/teddy_avx2/imp.rs Show resolved Hide resolved

Reduce diff

3ab2236

gnzlbg commented Jul 1, 2019

View reviewed changes

src/literal/teddy_avx2/imp.rs Show resolved Hide resolved

Update Hacking instructions about running benchmarks

8004ada

gnzlbg mentioned this pull request Jul 3, 2019

cargo bench --fails to build in MacOSX #591

Closed

Error properly when the benchmark run script is not used

4da205c

Closes rust-lang#591.

gnzlbg force-pushed the vector_abi branch from 320b8bd to 4da205c Compare July 3, 2019 17:23

BurntSushi closed this in 9f701e3 Jul 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make u8x16 and u8x32 have Vector call ABI #589

Make u8x16 and u8x32 have Vector call ABI #589

gnzlbg commented Jul 1, 2019 •

edited

Loading

RReverser Jul 1, 2019

gnzlbg Jul 1, 2019 •

edited

Loading

RReverser Jul 1, 2019

RReverser Jul 1, 2019

gnzlbg Jul 1, 2019 •

edited

Loading

RReverser Jul 1, 2019

gnzlbg Jul 1, 2019

RReverser Jul 1, 2019

gnzlbg Jul 1, 2019 •

edited

Loading

gnzlbg commented Jul 3, 2019 •

edited

Loading

BurntSushi commented Jul 4, 2019

BurntSushi commented Jul 4, 2019

Make u8x16 and u8x32 have Vector call ABI #589

Make u8x16 and u8x32 have Vector call ABI #589

Conversation

gnzlbg commented Jul 1, 2019 • edited Loading

RReverser Jul 1, 2019

Choose a reason for hiding this comment

gnzlbg Jul 1, 2019 • edited Loading

Choose a reason for hiding this comment

RReverser Jul 1, 2019

Choose a reason for hiding this comment

RReverser Jul 1, 2019

Choose a reason for hiding this comment

gnzlbg Jul 1, 2019 • edited Loading

Choose a reason for hiding this comment

RReverser Jul 1, 2019

Choose a reason for hiding this comment

gnzlbg Jul 1, 2019

Choose a reason for hiding this comment

RReverser Jul 1, 2019

Choose a reason for hiding this comment

gnzlbg Jul 1, 2019 • edited Loading

Choose a reason for hiding this comment

gnzlbg commented Jul 3, 2019 • edited Loading

BurntSushi commented Jul 4, 2019

BurntSushi commented Jul 4, 2019

gnzlbg commented Jul 1, 2019 •

edited

Loading

gnzlbg Jul 1, 2019 •

edited

Loading

gnzlbg Jul 1, 2019 •

edited

Loading

gnzlbg Jul 1, 2019 •

edited

Loading

gnzlbg commented Jul 3, 2019 •

edited

Loading