Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make u8x16 and u8x32 have Vector call ABI #589

Closed
wants to merge 4 commits into from

Conversation

gnzlbg
Copy link
Contributor

@gnzlbg gnzlbg commented Jul 1, 2019

Before this commit, u8x16 and u8x32 were repr(Rust) unions. This introduced
unspecified behavior because the field offsets of repr(Rust) unions are not
guaranteed to be at offset 0, so that field access was potentially UB.

This commit fixes that, and closes #588 .

The unions were also generating a lot of unnecessary memory operations. This
commit fixes that as well.

The issue is that unions have an Aggregate call ABI, which is the same as the
call ABI of arrays. That is, they are passed around by memory, and not in Vector
registers.

This is good, if most of the time one operates on them as arrays. This was,
however, not the case. Most of the operations on these unions are using SIMD
instructions. This means that the union needs to be copied into a SIMD register,
operated on, and then spilled back to the stack, on every single operation.
That's unnecessary, although apparently LLVM was able to optimize all the
unnecessary memory operations away and leave these always in registers.

This commit fixes this issue as well, by making the u8x16 and u8x32
repr(transparent) newtypes over the architecture specific vector types, giving
them the Vector ABI.

The vectors are then copied to the stack only when necessary, and as little as
possible. This is done using mem::transmute, removing the need for unions
altogether (fixing #588 by not having to worry about union layout at all).

To make it clear when the vectors are spilled into the stack, the
vector::replace(index, value) and vector::extract(index) APIs have been removed,
and instead, only a vector::bytes(self) and a vector::from_bytes(&mut self, [u8;
N]) APIs are provided instead. This prevents spilling the vectors back and forth
onto the stack every time an index needs to be modified, by using vector::bytes
to spill the vector to the stack once, making all the random-access
modifications in memory, and then using vector::from_bytes only once to move the
memory back into a SIMD register.


I haven't run any benchmarks, so please do benchmark that this does not introduce any performance regressions before merging this

@gnzlbg gnzlbg force-pushed the vector_abi branch 2 times, most recently from 3b47633 to 2e33415 Compare July 1, 2019 08:12
Before this commit, u8x16 and u8x32 were repr(Rust) unions. This introduced
unspecified behavior because the field offsets of repr(Rust) unions are not
guaranteed to be at offset 0, so that field access was potentially UB.

This commit fixes that, and closes rust-lang#588 .

The unions were also generating a lot of unnecessary memory operations. This
commit fixes that as well.

The issue is that unions have an Aggregate call ABI, which is the same as the
call ABI of arrays. That is, they are passed around by memory, and not in Vector
registers.

This is good, if most of the time one operates on them as arrays. This was,
however, not the case. Most of the operations on these unions are using SIMD
instructions. This means that the union needs to be copied into a SIMD register,
operated on, and then spilled back to the stack, on every single operation.
That's unnecessary, although apparently LLVM was able to optimize all the
unnecessary memory operations away and leave these always in registers.

This commit fixes this issue as well, by making the u8x16 and u8x32
repr(transparent) newtypes over the architecture specific vector types, giving
them the Vector ABI.

The vectors are then copied to the stack only when necessary, and as little as
possible. This is done using mem::transmute, removing the need for unions
altogether (fixing rust-lang#588 by not having to worry about union layout at all).

To make it clear when the vectors are spilled into the stack, the
vector::replace(index, value) and vector::extract(index) APIs have been removed,
and instead, only a vector::bytes(self) and a vector::from_bytes(&mut self, [u8;
N]) APIs are provided instead. This prevents spilling the vectors back and forth
onto the stack every time an index needs to be modified, by using vector::bytes
to spill the vector to the stack once, making all the random-access
modifications in memory, and then using vector::from_bytes only once to move the
memory back into a SIMD register.
src/vector/ssse3.rs Outdated Show resolved Hide resolved
}

#[inline]
pub fn replace_bytes(&mut self, value: [u8; 32]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps better to have bytes(&self) -> &[u8; 32] and bytes_mut(&mut self) -> &mut [u8; 32]? It would be more Rust-idiomatic than providing a getter+setter, and in some cases might give a faster way to change separate bytes without replacing the whole thing.

Copy link
Contributor Author

@gnzlbg gnzlbg Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps better to have bytes(&self) -> &[u8; 32]

Where does that [u8; 32] live ? (same for bytes_mut). Or are you suggesting transmuting &__m256i into a &[u8; 32] ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The body of the function wouldn't change - you'd still use transmute, just between references and not values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example with both implementations and you can see how code for setting a single byte differs between two: https://rust.godbolt.org/z/XF7m95

Copy link
Contributor Author

@gnzlbg gnzlbg Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the __m256i is in a SIMD register, it doesn't have a memory address, so creating a pointer to it requires spilling it to the stack to be able to give it an address. Once the modification on the stack is done, moving it back into a SIMD register requires copying the whole thing, not a single byte (at least if dynamic indices are involved, if the indices are compile-time constants, for some index values, sometimes, the compiler can do better).

One doesn't want this back-and-forth to happen accidentally, every time one modifies a part of the vector, and returning a &[u8; 32] would encourage that.

The proposed API forces its users to explicitly move the vector contents between the registers and memory. This revealed a couple of places where, e.g., instead of doing this back-and-forth on every iteration of a loop, one can just load the vector into memory once before the loop, operate on memory, and move the contents back into a SIMD register after the loop has completed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I'd argue that having several .bytes_mut()[...] = ... patterns in a row would also be pretty obvious and most people would store the reference when they could, but I see how your proposed API forces this a bit better.

However, if that's the goal, I wonder if it's worth to go one step further and take mutations as a callback like .with_bytes(|bytes| ...) where one can modify contents in any way they want inside, but then on the type system level you wouldn't be able to do anything else outside, and transformation back to a register would be guaranteed to happen as part of the call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if that's the goal, I wonder if it's worth to go one step further and take mutations as a callback like .with_bytes(|bytes| ...) where one can modify contents in any way they want inside, but then on the type system level you wouldn't be able to do anything else outside, and transformation back to a register would be guaranteed to happen as part of the call?

Some code only wants to read the bytes, so forcing a write would be bad for that code - I've pinged you in one example.

I also don't see really an advantage in restricting access to a scope. If you wanted to mutate two vectors, you would need to nest scopes. For an API that's only intended for internal consumption, this API would feel even more like overengineering than the one I proposed - and I consider mine borderline overengineering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fair enough. I'm not 100% comfortable with replace_bytes vs bytes_mut for idiomaticity reasons, but I can see how this is the least of the evils :)

Copy link
Contributor Author

@gnzlbg gnzlbg Jul 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example with both implementations and you can see how code for setting a single byte differs between two: https://rust.godbolt.org/z/XF7m95

As mentioned, this should be properly benchmarked before being merged. Those examples do not really use the intrinsics, so the code is being generated in isolation without any objective in mind. That is not very representative of what this library actually does. If you actually try to use the APIs, you'll see that they generate the exact same code when all optimizations are turned on: https://rust.godbolt.org/z/Z8sdH9

But this is not about optimizing the implementation, this is about optimizing the amount of work that LLVM has to do to produce efficient code. If you look at the LLVM-IR produced in debug mode: https://rust.godbolt.org/z/sMfxHm the version using replace_bytes produces 134 lines of LLVM-IR, while the version using bytes_mut produces 758 lines of LLVM-IR (if you look at the assembly at opt-level=1, you also see a much better result). There are a couple of factors at play here, but producing 6x more LLVM-IR for this isn't really worth it. LLVM can optimize it without problems, at least in this case where everything is private.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Jul 3, 2019

@BurntSushi my benchmarks results are here: https://gist.github.com/gnzlbg/dba8836f80237e0affd63d13d6274a37

Feels like noise. Is there a benchmark that's particularly heavy on the SSE and AVX implementations ?

Maybe someone else can also benchmark this ?

@BurntSushi BurntSushi closed this in 9f701e3 Jul 4, 2019
@BurntSushi
Copy link
Member

Thanks so much for this! The vector/aggregate ABI issue is something I was completely unaware of, and in retrospect, makes a lot of sense. When I first ported the Teddy code to std::arch (which is when I introduced the errant union type punning), I noticed that small perturbations to the code would result in fairly different codegen, in particular, with lots of additional mov instructions. It now seems likely that the root cause of this was the tension you described between the aggregate/vector ABIs. It seems LLVM was really saving my ass here. :P

In any case, I've run the benchmarks myself and everything looks okay. (The benchmarks are unfortunately pretty noisy.) I also did some ad hoc benchmarks by hand with perf and inspected the codegen. The codegen appears to be unchanged, with the exception of one fewer mov instruction, which makes sense given your explanation.

@BurntSushi
Copy link
Member

This PR is now in regex 1.1.8 on crates.io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unions used for type-punning should be repr(C)
3 participants