-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
impls on arrays seem slower than equivalent tuples #80140
Comments
@rustbot label I-slow |
Tagging another interested party: @aldanor |
The only difference in the compiled code is the array impl's extra check for pointer equality: https://rust.godbolt.org/z/1ea4e5 |
Huh, on 1.48.0 the tuple impl ends up branching on each value, and on nightly the array impl spills to the stack (?!). |
Correct me if I'm wrong, but since those arrays are being passed by value their addresses can never be the same right? So that check and branch are completely pointless? I could understand if it was over a slice or a ref to an array, but not on an array itself. |
Results on current nightly:
Release:
Looks basically the same to me, just a little bit of noise. |
IR variant of the above godbolt link is enlightening: https://rust.godbolt.org/z/Wb4dTY Looks like this is fallout from the switch to pass values smaller than 2 registers by-value. For the array case this ends up requiring a stack copy, because the bcmp is expanded too late to avoid it. For the tuple case the value is passed as an i96 (!!!) and then parts are extracted from it using bit arithmetic. I guess in the latter case LLVM should probably fold these comparisons down to one comparison of i96. |
That is true, but passing that information to LLVM results in miscompilations IIRC so it's not done by default. |
I thought we removed this comparison because it caused other problems, but the PR languished and died; resurrected in #80209. |
This resurrects rust-lang#71735. Fixes rust-lang#71602, helps with rust-lang#80140. r? `@Mark-Simulacrum`
Remove pointer comparison from slice equality This resurrects rust-lang#71735. Fixes rust-lang#71602, helps with rust-lang#80140. r? `@Mark-Simulacrum`
Despite that PR, the latest nightly sadly changes nothing for benchmark timings here on my system. |
#80209 only affects arrays that are passed indirectly. Arrays (or any aggregate, roughly) are passed indirectly when they're larger than When passed by value, LLVM can tell that the arrays never have the same address, and eliminates the check. So it would otherwise generate optimal code...but it also unnecessarily spills to the stack (#80140 (comment)), which is a different problem. |
I tried incrementing the size of my benchmark to 12 elements (the largest size of tuple that has a PartialEq impl), and changed the types to i64, to ensure they'd be large enough to be passed indirectly. Same result. use criterion::{black_box, Criterion};
fn main() {
let mut c = Criterion::default();
c.bench_function("partialeq-array", |b| {
b.iter(|| {
let a: [i64; 12] = black_box([0; 12]);
let b: [i64; 12] = black_box([0; 12]);
for _ in 0..1_000_000 {
black_box(a.eq(&b));
}
})
});
c.bench_function("partialeq-tuple", |b| {
b.iter(|| {
let a: (i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64) =
black_box((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0));
let b: (i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64, i64) =
black_box((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0));
for _ in 0..1_000_000 {
black_box(a.eq(&b));
}
})
});
}
Strangely the tuple time didn't actually get any longer despite being much bigger. Is there something wrong in the benchmark or is it seeing through black_box somehow? |
As of 1.51.0 these timings remain unchanged, both for small types (3 i32s) and large ones (12 i64s). |
This seems to be somewhat resolved sometime between 1.51 and the current nightly. While the assembly generated is not the same, the runtime of both is now roughly equivalent on my system:
|
Is it fixed on beta? |
This seems to be fixed in the current nightly. While debug builds still exhibit the issue it's to a much smaller degree, and release builds don't show it at all. |
Sadly |
When running the following program:
Cargo.toml:
main.rs:
I get these results in debug builds:
And these results in release builds:
I had expected these to be equivalent, but tuples are significantly faster than arrays. This seems to apply to any size of number (i8, u32, etc), and some other traits as well (notably Hash).
The text was updated successfully, but these errors were encountered: