Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ptr_rotate performance, tests, and benches #61937

Merged
merged 1 commit into from
Aug 9, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions src/libcore/benches/slice.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,29 @@ fn binary_search_l2_with_dups(b: &mut Bencher) {
fn binary_search_l3_with_dups(b: &mut Bencher) {
binary_search(b, Cache::L3, |i| i / 16 * 16);
}

macro_rules! rotate {
($fn:ident, $n:expr, $mapper:expr) => {
#[bench]
fn $fn(b: &mut Bencher) {
let mut x = (0usize..$n).map(&$mapper).collect::<Vec<_>>();
b.iter(|| {
for s in 0..x.len() {
x[..].rotate_right(s);
}
black_box(x[0].clone())
})
}
};
}

#[derive(Clone)]
struct Rgb(u8, u8, u8);

rotate!(rotate_u8, 32, |i| i as u8);
rotate!(rotate_rgb, 32, |i| Rgb(i as u8, (i as u8).wrapping_add(7), (i as u8).wrapping_add(42)));
rotate!(rotate_usize, 32, |i| i);
rotate!(rotate_16_usize_4, 16, |i| [i; 4]);
rotate!(rotate_16_usize_5, 16, |i| [i; 5]);
rotate!(rotate_64_usize_4, 64, |i| [i; 4]);
rotate!(rotate_64_usize_5, 64, |i| [i; 5]);
221 changes: 152 additions & 69 deletions src/libcore/slice/rotate.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,88 +2,171 @@ use crate::cmp;
use crate::mem::{self, MaybeUninit};
use crate::ptr;

/// Rotation is much faster if it has access to a little bit of memory. This
/// union provides a RawVec-like interface, but to a fixed-size stack buffer.
#[allow(unions_with_drop_fields)]
union RawArray<T> {
/// Ensure this is appropriately aligned for T, and is big
/// enough for two elements even if T is enormous.
typed: [T; 2],
/// For normally-sized types, especially things like u8, having more
/// than 2 in the buffer is necessary for usefulness, so pad it out
/// enough to be helpful, but not so big as to risk overflow.
_extra: [usize; 32],
}

impl<T> RawArray<T> {
fn capacity() -> usize {
if mem::size_of::<T>() == 0 {
usize::max_value()
} else {
mem::size_of::<Self>() / mem::size_of::<T>()
}
}
}

/// Rotates the range `[mid-left, mid+right)` such that the element at `mid`
/// becomes the first element. Equivalently, rotates the range `left`
/// elements to the left or `right` elements to the right.
/// Rotates the range `[mid-left, mid+right)` such that the element at `mid` becomes the first
/// element. Equivalently, rotates the range `left` elements to the left or `right` elements to the
/// right.
///
/// # Safety
///
/// The specified range must be valid for reading and writing.
///
/// # Algorithm
///
/// For longer rotations, swap the left-most `delta = min(left, right)`
/// elements with the right-most `delta` elements. LLVM vectorizes this,
/// which is profitable as we only reach this step for a "large enough"
/// rotation. Doing this puts `delta` elements on the larger side into the
/// correct position, leaving a smaller rotate problem. Demonstration:
///
/// Algorithm 1 is used for small values of `left + right` or for large `T`. The elements are moved
/// into their final positions one at a time starting at `mid - left` and advancing by `right` steps
/// modulo `left + right`, such that only one temporary is needed. Eventually, we arrive back at
/// `mid - left`. However, if `gcd(left + right, right)` is not 1, the above steps skipped over
/// elements. For example:
/// ```text
/// [ 6 7 8 9 10 11 12 13 . 1 2 3 4 5 ]
/// 1 2 3 4 5 [ 11 12 13 . 6 7 8 9 10 ]
/// 1 2 3 4 5 [ 8 9 10 . 6 7 ] 11 12 13
/// 1 2 3 4 5 6 7 [ 10 . 8 9 ] 11 12 13
/// 1 2 3 4 5 6 7 [ 9 . 8 ] 10 11 12 13
/// 1 2 3 4 5 6 7 8 [ . ] 9 10 11 12 13
/// left = 10, right = 6
/// the `^` indicates an element in its final place
/// 6 7 8 9 10 11 12 13 14 15 . 0 1 2 3 4 5
/// after using one step of the above algorithm (The X will be overwritten at the end of the round,
/// and 12 is stored in a temporary):
/// X 7 8 9 10 11 6 13 14 15 . 0 1 2 3 4 5
/// ^
/// after using another step (now 2 is in the temporary):
/// X 7 8 9 10 11 6 13 14 15 . 0 1 12 3 4 5
/// ^ ^
/// after the third step (the steps wrap around, and 8 is in the temporary):
/// X 7 2 9 10 11 6 13 14 15 . 0 1 12 3 4 5
/// ^ ^ ^
/// after 7 more steps, the round ends with the temporary 0 getting put in the X:
/// 0 7 2 9 4 11 6 13 8 15 . 10 1 12 3 14 5
/// ^ ^ ^ ^ ^ ^ ^ ^
/// ```
/// Fortunately, the number of skipped over elements between finalized elements is always equal, so
/// we can just offset our starting position and do more rounds (the total number of rounds is the
/// `gcd(left + right, right)` value). The end result is that all elements are finalized once and
/// only once.
///
/// Algorithm 2 is used if `left + right` is large but `min(left, right)` is small enough to
/// fit onto a stack buffer. The `min(left, right)` elements are copied onto the buffer, `memmove`
/// is applied to the others, and the ones on the buffer are moved back into the hole on the
/// opposite side of where they originated.
///
/// Algorithms that can be vectorized outperform the above once `left + right` becomes large enough.
/// Algorithm 1 can be vectorized by chunking and performing many rounds at once, but there are too
/// few rounds on average until `left + right` is enormous, and the worst case of a single
/// round is always there. Instead, algorithm 3 utilizes repeated swapping of
/// `min(left, right)` elements until a smaller rotate problem is left.
///
/// Once the rotation is small enough, copy some elements into a stack
/// buffer, `memmove` the others, and move the ones back from the buffer.
pub unsafe fn ptr_rotate<T>(mut left: usize, mid: *mut T, mut right: usize) {
/// ```text
/// left = 11, right = 4
/// [4 5 6 7 8 9 10 11 12 13 14 . 0 1 2 3]
/// ^ ^ ^ ^ ^ ^ ^ ^ swapping the right most elements with elements to the left
/// [4 5 6 7 8 9 10 . 0 1 2 3] 11 12 13 14
/// ^ ^ ^ ^ ^ ^ ^ ^ swapping these
/// [4 5 6 . 0 1 2 3] 7 8 9 10 11 12 13 14
/// we cannot swap any more, but a smaller rotation problem is left to solve
/// ```
/// when `left < right` the swapping happens from the left instead.
pub unsafe fn ptr_rotate<T>(mut left: usize, mut mid: *mut T, mut right: usize) {
type BufType = [usize; 32];
if mem::size_of::<T>() == 0 {
return;
}
loop {
let delta = cmp::min(left, right);
if delta <= RawArray::<T>::capacity() {
// We will always hit this immediately for ZST.
break;
// N.B. the below algorithms can fail if these cases are not checked
if (right == 0) || (left == 0) {
return;
}

ptr::swap_nonoverlapping(
mid.sub(left),
mid.add(right - delta),
delta);

if left <= right {
right -= delta;
if (left + right < 24) || (mem::size_of::<T>() > mem::size_of::<[usize; 4]>()) {
// Algorithm 1
// Microbenchmarks indicate that the average performance for random shifts is better all
// the way until about `left + right == 32`, but the worst case performance breaks even
// around 16. 24 was chosen as middle ground. If the size of `T` is larger than 4
// `usize`s, this algorithm also outperforms other algorithms.
let x = mid.sub(left);
// beginning of first round
let mut tmp: T = x.read();
let mut i = right;
// `gcd` can be found before hand by calculating `gcd(left + right, right)`,
// but it is faster to do one loop which calculates the gcd as a side effect, then
// doing the rest of the chunk
let mut gcd = right;
// benchmarks reveal that it is faster to swap temporaries all the way through instead
// of reading one temporary once, copying backwards, and then writing that temporary at
// the very end. This is possibly due to the fact that swapping or replacing temporaries
// uses only one memory address in the loop instead of needing to manage two.
loop {
tmp = x.add(i).replace(tmp);
// instead of incrementing `i` and then checking if it is outside the bounds, we
// check if `i` will go outside the bounds on the next increment. This prevents
// any wrapping of pointers or `usize`.
if i >= left {
i -= left;
if i == 0 {
// end of first round
x.write(tmp);
scottmcm marked this conversation as resolved.
Show resolved Hide resolved
break;
}
// this conditional must be here if `left + right >= 15`
if i < gcd {
gcd = i;
}
} else {
i += right;
}
}
// finish the chunk with more rounds
for start in 1..gcd {
tmp = x.add(start).read();
i = start + right;
loop {
tmp = x.add(i).replace(tmp);
if i >= left {
i -= left;
if i == start {
x.add(start).write(tmp);
break;
}
} else {
i += right;
}
}
}
return;
// `T` is not a zero-sized type, so it's okay to divide by its size.
} else if cmp::min(left, right) <= mem::size_of::<BufType>() / mem::size_of::<T>() {
// Algorithm 2
// The `[T; 0]` here is to ensure this is appropriately aligned for T
let mut rawarray = MaybeUninit::<(BufType, [T; 0])>::uninit();
let buf = rawarray.as_mut_ptr() as *mut T;
scottmcm marked this conversation as resolved.
Show resolved Hide resolved
let dim = mid.sub(left).add(right);
if left <= right {
ptr::copy_nonoverlapping(mid.sub(left), buf, left);
ptr::copy(mid, mid.sub(left), right);
ptr::copy_nonoverlapping(buf, dim, left);
} else {
ptr::copy_nonoverlapping(mid, buf, right);
ptr::copy(mid.sub(left), dim, left);
ptr::copy_nonoverlapping(buf, mid.sub(left), right);
}
return;
} else if left >= right {
// Algorithm 3
// There is an alternate way of swapping that involves finding where the last swap
// of this algorithm would be, and swapping using that last chunk instead of swapping
// adjacent chunks like this algorithm is doing, but this way is still faster.
loop {
ptr::swap_nonoverlapping(mid.sub(right), mid, right);
mid = mid.sub(right);
left -= right;
if left < right {
break;
}
}
} else {
left -= delta;
// Algorithm 3, `left < right`
loop {
ptr::swap_nonoverlapping(mid.sub(left), mid, left);
mid = mid.add(left);
right -= left;
if right < left {
break;
}
}
}
}

let mut rawarray = MaybeUninit::<RawArray<T>>::uninit();
let buf = &mut (*rawarray.as_mut_ptr()).typed as *mut [T; 2] as *mut T;

let dim = mid.sub(left).add(right);
if left <= right {
ptr::copy_nonoverlapping(mid.sub(left), buf, left);
ptr::copy(mid, mid.sub(left), right);
ptr::copy_nonoverlapping(buf, dim, left);
}
else {
ptr::copy_nonoverlapping(mid, buf, right);
ptr::copy(mid.sub(left), dim, left);
ptr::copy_nonoverlapping(buf, mid.sub(left), right);
}
}
38 changes: 38 additions & 0 deletions src/libcore/tests/slice.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1130,6 +1130,44 @@ fn test_rotate_right() {
}
}

#[test]
#[cfg(not(miri))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should come with a comment explaining why. I will add a Miri is too slow in one of the Miri-adjustment PRs I have in flight anyway.

fn brute_force_rotate_test_0() {
// In case of edge cases involving multiple algorithms
let n = 300;
for len in 0..n {
for s in 0..len {
let mut v = Vec::with_capacity(len);
for i in 0..len {
v.push(i);
}
v[..].rotate_right(s);
for i in 0..v.len() {
assert_eq!(v[i], v.len().wrapping_add(i.wrapping_sub(s)) % v.len());
}
}
}
}

#[test]
fn brute_force_rotate_test_1() {
// `ptr_rotate` covers so many kinds of pointer usage, that this is just a good test for
// pointers in general. This uses a `[usize; 4]` to hit all algorithms without overwhelming miri
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you so much for keeping Miri in mind here. <3

let n = 30;
for len in 0..n {
for s in 0..len {
let mut v: Vec<[usize; 4]> = Vec::with_capacity(len);
for i in 0..len {
v.push([i, 0, 0, 0]);
}
v[..].rotate_right(s);
for i in 0..v.len() {
assert_eq!(v[i][0], v.len().wrapping_add(i.wrapping_sub(s)) % v.len());
}
}
}
}

#[test]
#[cfg(not(target_arch = "wasm32"))]
fn sort_unstable() {
Expand Down