Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ptr_rotate faster #61784

Closed
AaronKutch opened this issue Jun 12, 2019 · 10 comments
Closed

Make ptr_rotate faster #61784

AaronKutch opened this issue Jun 12, 2019 · 10 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@AaronKutch
Copy link
Contributor

AaronKutch commented Jun 12, 2019

I was going to directly PR the performance improvement, but since this is my first time writing a bunch of unsafe code, and because x.py will not work for me (#61611), I decided to open an issue and point to this repository filled with all the benchmarks I wrote.

I tried over 20 different variations on algorithms before settling on this one.
The only thing I am concerned with is that I am directly using T values in algorithm 1 instead of indirect copies and swaps, which the microbenchmarks did not like. I am pretty sure that it is panic safe, since there is no T: clone anywhere.

I benchmarked on both Intel and AMD CPUs, and found that only on the x1031s149_new benchmark did my algorithm perform significantly worse (probably due to the fact that 1031 and 149 are both primes), and the u8_new benchmark on my AMD CPU. I do not know what is causing the u8_new to be much worse on the new algorithm on my AMD CPU, and if the problem persists on newer AMD hardware. Every other benchmark indicates that my algorithm is as fast or faster than the old one (note that because there are so many benchmarks, there may be outliers which you should benchmark again).

Of course, when I do the PR I am not going to include all of those benchmarks, but I am not sure what subset to include.

@Centril Centril added I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jun 12, 2019
@scottmcm
Copy link
Member

One thing I suggest including that I didn't see in your benchmark list: struct RGB { r: u8, g: u8, b: u8 }.

Anything that couldn't do bigger chunks ended up with bad performance rotating those last time I was experimenting, presumably because it needed 6 byte-level memory operations per move.

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Jun 13, 2019

I renamed some benchmarks and added the RGB test. The new algorithm performs better on Intel at least. I just remembered that the AMD CPU is extremely inconsistent with microbenchmarks, with the measurements swinging wildly. I think I have done all I can do, unless there is some SIMD trickery I can apply. My algorithm 1 is still better than using algorithm 2 for u8.

I copied the [usize; 32] buffer concept from the old algorithm, and the old algorithm claims it is a stack buffer. Is it actually a stack buffer, because it looks to me like it is a heap buffer with the fact that it has MaybeUninit and pointers obscuring that? edit: I was just confused about there being a pointer to something on the stack. Is there something I can do to speed it up?

@mati865
Copy link
Contributor

mati865 commented Jun 13, 2019

I do not know what is causing the u8_new to be much worse on the new algorithm on my AMD CPU, and if the problem persists on newer AMD hardware.

I see you are using Windows which doesn't handle high core counts well. It got better with Windows 10 1903 but there is still room to improve.

I can run benchmarks on Linux with Ryzen 2nd gen if you want.

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Jun 13, 2019

Yes thank you. Edit: don't benchmark it yet, I need to fix some problems I found

@AaronKutch
Copy link
Contributor Author

Ok, I pushed a few benchmarking fixes to the repository. The benches no longer clone and the performance difference is much more apparent.

@AaronKutch
Copy link
Contributor Author

What I plan on doing is replacing the old ptr_rotate implementation with the new one, adding the test commented in the repository to libcore/tests/slice.rs, and adding the following to libcore/benches/slice.rs:

macro_rules! rotate {
    ($fn:ident, $n:expr, $mapper:expr) => {
        #[bench]
        fn $fn(b: &mut Bencher) {
            let mut x = (0usize..$n).map(&$mapper).collect::<Vec<_>>();
            b.iter(|| {
                for s in 0..x.len() {
                    x[..].rotate_right(s);
                }
                black_box(x[0].clone())
            })
        }
    }
}

#[derive(Clone)]
struct Rgb(u8,u8,u8);

rotate!(rotate_u8, 32, |i| i as u8);
rotate!(rotate_rgb, 32, |i| Rgb(i as u8, (i as u8).wrapping_add(7), (i as u8).wrapping_add(42)));
rotate!(rotate_usize, 32, |i| i);
rotate!(rotate_usize_4, 32, |i| [i; 4]);
rotate!(rotate_usize_32, 32, |i| [i; 32]);
rotate!(rotate_usize_33, 32, |i| [i; 33]);

Does this sound good?

@mati865
Copy link
Contributor

mati865 commented Jun 14, 2019

Here are results of nightly 2019-06-13 on box with 2700X running Ubuntu 19.04: https://gist.github.com/mati865/c8954a2c03ca7a10a71ebb54a229ae34
Glibc is well optimized so the perf looks good but within margin of error, on less optimized musl new functions are significant win.

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Jun 14, 2019

The size of the slice appears to have a significant effect on performance, even when the algorithm takes the same branches. For example, running

    rotate_bench!(x1031s149_old, x1031s149_new, 1031, 149); //rotate a slice of size 1031 by 149 to the right
    rotate_bench!(x1032s149_old, x1032s149_new, 1032, 149); //rotate a slice of size 1032 by 149 to the right

gives

test benches::x1031s149_new ... bench:         487 ns/iter (+/- 19)
test benches::x1031s149_old ... bench:         409 ns/iter (+/- 31)
test benches::x1032s149_new ... bench:         450 ns/iter (+/- 15)
test benches::x1032s149_old ... bench:         526 ns/iter (+/- 119)

These values change when I repeat the benchmarks, but the simple change of 1031 -> 1032 consistently improves for my algorithm, and consistently is worse for the old algorithm. I'm not sure what is causing this, even when inserting dbg! statements for my algorithm, it looks like this:

1031
[src\main.rs:152] "3.0" = "3.0" //algorithm 3
[src\main.rs:152] left = 882
[src\main.rs:152] right = 149
[src\main.rs:166] "3.1" = "3.1" //algorithm 3 but left < right
[src\main.rs:166] left = 137
[src\main.rs:166] right = 149
[src\main.rs:136] "2" = "2" //algorithm 2
[src\main.rs:136] left = 137
[src\main.rs:136] right = 12
1032
[src\main.rs:152] "3.0" = "3.0"
[src\main.rs:152] left = 883
[src\main.rs:152] right = 149
[src\main.rs:166] "3.1" = "3.1"
[src\main.rs:166] left = 138
[src\main.rs:166] right = 149
[src\main.rs:136] "2" = "2"
[src\main.rs:136] left = 138
[src\main.rs:136] right = 11

@AaronKutch
Copy link
Contributor Author

Would this have something to do with cache lines?

Centril added a commit to Centril/rust that referenced this issue Aug 9, 2019
Improve `ptr_rotate` performance, tests, and benches

The corresponding issue is rust-lang#61784. I am not actually sure if miri can handle the test, but I can change the commit if necessary.
bors added a commit that referenced this issue Aug 9, 2019
Improve `ptr_rotate` performance, tests, and benches

The corresponding issue is #61784. I am not actually sure if miri can handle the test, but I can change the commit if necessary.
@AaronKutch
Copy link
Contributor Author

It was finally merged. I fixed some alignment problems and added a modification to the algorithm 1 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

4 participants