Auto merge of #107634 - scottmcm:array-drain, r=thomcc

Improve the `array::map` codegen The `map` method on arrays [is documented as sometimes performing poorly](https://doc.rust-lang.org/std/primitive.array.html#note-on-performance-and-stack-usage), and after [a question on URLO](https://users.rust-lang.org/t/try-trait-residual-o-trait-and-try-collect-into-array/88510?u=scottmcm) prompted me to take another look at the core [`try_collect_into_array`](https://github.com/rust-lang/rust/blob/7c46fb2111936ad21a8e3aa41e9128752357f5d8/library/core/src/array/mod.rs#L865-L912) function, I had some ideas that ended up working better than I'd expected. There's three main ideas in here, split over three commits: 1. Don't use `array::IntoIter` when we can avoid it, since that seems to not get SRoA'd, meaning that every step writes things like loop counters into the stack unnecessarily 2. Don't return arrays in `Result`s unnecessarily, as that doesn't seem to optimize away even with `unwrap_unchecked` (perhaps because it needs to get moved into a new LLVM type to account for the discriminant) 3. Don't distract LLVM with all the `Option` dances when we know for sure we have enough items (like in `map` and `zip`). This one's a larger commit as to do it I ended up adding a new `pub(crate)` trait, but hopefully those changes are still straight-forward. (No libs-api changes; everything should be completely implementation-detail-internal.) It's still not completely fixed -- I think it needs pcwalton's `memcpy` optimizations still (#103830) to get further -- but this seems to go much better than before. And the remaining `memcpy`s are just `transmute`-equivalent (`[T; N] -> ManuallyDrop<[T; N]>` and `[MaybeUninit<T>; N] -> [T; N]`), so hopefully those will be easier to remove with LLVM16 than the previous subobject copies 🤞 r? `@thomcc` As a simple example, this test ```rust pub fn long_integer_map(x: [u32; 64]) -> [u32; 64] { x.map(|x| 13 * x + 7) } ``` On nightly <https://rust.godbolt.org/z/xK7548TGj> takes `sub rsp, 808` ```llvm start: %array.i.i.i.i = alloca [64 x i32], align 4 %_3.sroa.5.i.i.i = alloca [65 x i32], align 4 %_5.i = alloca %"core::iter::adapters::map::Map<core::array::iter::IntoIter<u32, 64>, [closure@/app/example.rs:2:11: 2:14]>", align 8 ``` (and yes, that's a 6**5**-element array `alloca` despite 6**4**-element input and output) But with this PR it's only `sub rsp, 520` ```llvm start: %array.i.i.i.i.i.i = alloca [64 x i32], align 4 %array1.i.i.i = alloca %"core::mem::manually_drop::ManuallyDrop<[u32; 64]>", align 4 ``` Similarly, the loop it emits on nightly is scalar-only and horrifying ```nasm .LBB0_1: mov esi, 64 mov edi, 0 cmp rdx, 64 je .LBB0_3 lea rsi, [rdx + 1] mov qword ptr [rsp + 784], rsi mov r8d, dword ptr [rsp + 4*rdx + 528] mov edi, 1 lea edx, [r8 + 2*r8] lea r8d, [r8 + 4*rdx] add r8d, 7 .LBB0_3: test edi, edi je .LBB0_11 mov dword ptr [rsp + 4*rcx + 272], r8d cmp rsi, 64 jne .LBB0_6 xor r8d, r8d mov edx, 64 test r8d, r8d jne .LBB0_8 jmp .LBB0_11 .LBB0_6: lea rdx, [rsi + 1] mov qword ptr [rsp + 784], rdx mov edi, dword ptr [rsp + 4*rsi + 528] mov r8d, 1 lea esi, [rdi + 2*rdi] lea edi, [rdi + 4*rsi] add edi, 7 test r8d, r8d je .LBB0_11 .LBB0_8: mov dword ptr [rsp + 4*rcx + 276], edi add rcx, 2 cmp rcx, 64 jne .LBB0_1 ``` whereas with this PR it's unrolled and vectorized ```nasm vpmulld ymm1, ymm0, ymmword ptr [rsp + 64] vpaddd ymm1, ymm1, ymm2 vmovdqu ymmword ptr [rsp + 328], ymm1 vpmulld ymm1, ymm0, ymmword ptr [rsp + 96] vpaddd ymm1, ymm1, ymm2 vmovdqu ymmword ptr [rsp + 360], ymm1 ``` (though sadly still stack-to-stack)
rust-lang · Feb 13, 2023 · 2d91939 · 2d91939
2 parents 2008188 + bb77860
commit 2d91939
Show file tree

Hide file tree

Showing 16 changed files with 395 additions and 142 deletions.
diff --git a/library/core/src/array/drain.rs b/library/core/src/array/drain.rs
@@ -0,0 +1,76 @@
+use crate::iter::{TrustedLen, UncheckedIterator};
+use crate::mem::ManuallyDrop;
+use crate::ptr::drop_in_place;
+use crate::slice;
+
+/// A situationally-optimized version of `array.into_iter().for_each(func)`.
+///
+/// [`crate::array::IntoIter`]s are great when you need an owned iterator, but
+/// storing the entire array *inside* the iterator like that can sometimes
+/// pessimize code.  Notable, it can be more bytes than you really want to move
+/// around, and because the array accesses index into it SRoA has a harder time
+/// optimizing away the type than it does iterators that just hold a couple pointers.
+///
+/// Thus this function exists, which gives a way to get *moved* access to the
+/// elements of an array using a small iterator -- no bigger than a slice iterator.
+///
+/// The function-taking-a-closure structure makes it safe, as it keeps callers
+/// from looking at already-dropped elements.
+pub(crate) fn drain_array_with<T, R, const N: usize>(
+    array: [T; N],
+    func: impl for<'a> FnOnce(Drain<'a, T>) -> R,
+) -> R {
+    let mut array = ManuallyDrop::new(array);
+    // SAFETY: Now that the local won't drop it, it's ok to construct the `Drain` which will.
+    let drain = Drain(array.iter_mut());
+    func(drain)
+}
+
+/// See [`drain_array_with`] -- this is `pub(crate)` only so it's allowed to be
+/// mentioned in the signature of that method.  (Otherwise it hits `E0446`.)
+// INVARIANT: It's ok to drop the remainder of the inner iterator.
+pub(crate) struct Drain<'a, T>(slice::IterMut<'a, T>);
+
+impl<T> Drop for Drain<'_, T> {
+    fn drop(&mut self) {
+        // SAFETY: By the type invariant, we're allowed to drop all these.
+        unsafe { drop_in_place(self.0.as_mut_slice()) }
+    }
+}
+
+impl<T> Iterator for Drain<'_, T> {
+    type Item = T;
+
+    #[inline]
+    fn next(&mut self) -> Option<T> {
+        let p: *const T = self.0.next()?;
+        // SAFETY: The iterator was already advanced, so we won't drop this later.
+        Some(unsafe { p.read() })
+    }
+
+    #[inline]
+    fn size_hint(&self) -> (usize, Option<usize>) {
+        let n = self.len();
+        (n, Some(n))
+    }
+}
+
+impl<T> ExactSizeIterator for Drain<'_, T> {
+    #[inline]
+    fn len(&self) -> usize {
+        self.0.len()
+    }
+}
+
+// SAFETY: This is a 1:1 wrapper for a slice iterator, which is also `TrustedLen`.
+unsafe impl<T> TrustedLen for Drain<'_, T> {}
+
+impl<T> UncheckedIterator for Drain<'_, T> {
+    unsafe fn next_unchecked(&mut self) -> T {
+        // SAFETY: `Drain` is 1:1 with the inner iterator, so if the caller promised
+        // that there's an element left, the inner iterator has one too.
+        let p: *const T = unsafe { self.0.next_unchecked() };
+        // SAFETY: The iterator was already advanced, so we won't drop this later.
+        unsafe { p.read() }
+    }
+}