Missed loop optimization for `&Vec` argument in .zip() #36920

bluss · 2016-10-03T03:34:40Z

This is a case where two almost identical functions have very different performance, because the slice version is autovectorized and the vector version is not. Note that specialization is involved, but both cases use exactly the same code path, for the slice iterator.

#![crate_type="lib"]

#[no_mangle]
pub fn dot_s(xs: &[u32], ys: &[u32]) -> u32 {
    let mut s = 0;
    for (x, y) in xs.iter().zip(ys) {
        s += x * y;
    }
    s
}

#[no_mangle]
pub fn dot_v(xs: &Vec<u32>, ys: &Vec<u32>) -> u32 {
    let mut s = 0;
    for (x, y) in xs.iter().zip(ys) {
        s += x * y;
    }
    s
}

Using rustc 1.14.0-nightly (289f3a4 2016-09-29)

Note: The Vec vs slice difference is sometimes visible in actual inline use of .zip in code, depending on exact context.

The text was updated successfully, but these errors were encountered:

arielb1 · 2016-10-03T11:38:34Z

Note: a similar version with xs: &mut [u32] (and therefore no noalias metadata) still vectorizes:

#[no_mangle]
pub fn dot_mut_s(xs: &mut [u32], ys: &mut [u32]) -> u32 {
    let mut s = 0;
    for (x, y) in xs.iter().zip(ys) {
        s += (*x) * (*y);
    }
    s
}

Maybe the problem is nonnull metadata (vs. call to assume)? The only difference between the 2 is the call to Vec::deref?

bluss · 2016-10-03T11:59:47Z

It may just be the double indirection in &Vec vs &[u32] ?

I've seen double indirection losses like that in closures too (references captured by reference vs move).

Tries Yes! Double indirection is bad. No autovectorization:

#[no_mangle]
pub fn dot_ref_s(xs: &&[u32], ys: &&[u32]) -> u32 {
    let mut s = 0;
    for (x, y) in xs.iter().zip(*ys) {
        s += x * y;
    }
    s
}

Playground link

bluss · 2016-10-03T12:01:06Z

Inside ndarray I carefully use move closures to capture references as they are, and this seems to be the reason.

arielb1 · 2016-10-03T14:57:31Z

The problem indeed looks like a dubious null check surviving:

; Function Attrs: readonly uwtable
define i32 @dot_ref_s({ i32*, i64 }* noalias nocapture readonly dereferenceable(16), { i32*, i64 }* noalias nocapture readonly dereferenceable(16)) unnamed_addr #2 personality i32 (i32, i32, i64, %"8.unwind::libunwind::_Unwind_Exception"*, %"8.unwind::libunwind::_Unwind_Context"*)* @rust_eh_personality {
entry-block:
  %2 = getelementptr inbounds { i32*, i64 }, { i32*, i64 }* %0, i64 0, i32 0
  %3 = getelementptr inbounds { i32*, i64 }, { i32*, i64 }* %0, i64 0, i32 1
  %4 = load i32*, i32** %2, align 8
  %5 = load i64, i64* %3, align 8
  %6 = getelementptr inbounds { i32*, i64 }, { i32*, i64 }* %1, i64 0, i32 0
  %7 = getelementptr inbounds { i32*, i64 }, { i32*, i64 }* %1, i64 0, i32 1
  %8 = load i32*, i32** %6, align 8
  %9 = load i64, i64* %7, align 8
  %10 = icmp ule i64 %5, %9
  %_0.0.sroa.speculated.i.i.i = select i1 %10, i64 %5, i64 %9
  %11 = icmp eq i64 %_0.0.sroa.speculated.i.i.i, 0
  br i1 %11, label %bb6, label %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader"

"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader": ; preds = %entry-block
  br label %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit"

"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit": ; preds = %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader", %bb7
  %s.017 = phi i32 [ %18, %bb7 ], [ 0, %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader" ]
  %iter.sroa.37.016 = phi i64 [ %14, %bb7 ], [ 0, %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit.preheader" ]
  %12 = getelementptr inbounds i32, i32* %4, i64 %iter.sroa.37.016
  %switchtmp = icmp eq i32* %12, null ; <- ********************************** NULL CHECK HERE
  br i1 %switchtmp, label %bb6.loopexit, label %bb7

bb6.loopexit:                                     ; preds = %bb7, %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit"
  %s.0.lcssa.ph = phi i32 [ %s.017, %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit" ], [ %18, %bb7 ]
  br label %bb6

bb6:                                              ; preds = %bb6.loopexit, %entry-block
  %s.0.lcssa = phi i32 [ 0, %entry-block ], [ %s.0.lcssa.ph, %bb6.loopexit ]
  ret i32 %s.0.lcssa

bb7:                                              ; preds = %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit"
  %13 = getelementptr inbounds i32, i32* %8, i64 %iter.sroa.37.016
  %14 = add nuw i64 %iter.sroa.37.016, 1
  %15 = load i32, i32* %12, align 4
  %16 = load i32, i32* %13, align 4
  %17 = mul i32 %16, %15
  %18 = add i32 %17, %s.017
  %19 = icmp ult i64 %14, %_0.0.sroa.speculated.i.i.i
  br i1 %19, label %"_ZN84_$LT$core..iter..Zip$LT$A$C$$u20$B$GT$$u20$as$u20$core..iter..iterator..Iterator$GT$4next17h91c35d8ab9003ffeE.exit", label %bb6.loopexit
}

arielb1 · 2016-10-03T15:53:17Z

Fixing this is blocked on https://llvm.org/bugs/show_bug.cgi?id=30597.

arielb1 · 2016-10-04T10:00:18Z

The LLVM PR fixes the Vec case (tested locally). Fixing the &&[u32] case requires us to emit more attributes.

cc rust-lang#36920 (in addition to LLVM PR30597, should fix the &&[i32] case)

bluss · 2016-10-12T10:01:43Z

rustc 0b2c356 + that llvm backport revision do fix all the test cases in this issue. Awesome. Now to try what more it fixes

bluss · 2016-10-12T11:08:08Z

ndarray changes (selection of benchmarks, rest seemed to be unchanged). Benchmarking is a bit noisy, so I had to pick a few and try to verify them.

comparing

old: rustc 1.14.0-nightly (a3bc191 2016-10-10)
new: rustc 1.14.0-dev 0b2c356 + that llvm backport revision

name                           old-best ns/iter  new-best ns/iter    diff ns/iter   diff %
iter_sum_2d_regular            178               177                           -1   -0.56%
iter_sum_2d_transpose_by_row   3,163             2,603                       -560  -17.70%
iter_sum_2d_transpose_regular  4,783             5,056                        273    5.71%
mean_axis0                     5,474             4,736                       -738  -13.48%
mean_axis1                     2,977             2,918                        -59   -1.98%
scalar_sum_2d_cutout           505               504                           -1   -0.20%
scalar_sum_2d_float            417               416                           -1   -0.24%
scalar_sum_2d_float_cutout     543               544                            1    0.18%
scalar_sum_2d_regular          169               169                            0    0.00%
sum_axis0                      5,057             4,550                       -507  -10.03%
sum_axis1                      2,768             2,720                        -48   -1.73%

I believe the improvements in mean_axis0, sum_axis0 and iter_sum_2d_transpose_by_row are legitimate and that iter_sum_2d_transpose_regular genuinely regresses for some reason. I don't think the rest can be verified to change in either direction.

It's nice overall though.,

arielb1 · 2016-10-12T21:24:11Z

Any code changes in iter_sum_2d_transpose_regular?

Contains backport for my patch, [InstCombine] Transform !range metadata to !nonnull when combining loads. Fixes rust-lang#36920.

bluss · 2016-10-12T23:07:59Z

I'm not really finding what the difference is. Do you think I should look in the unoptimized or optimized llvm IR? It's not a big concern since it's already the slow case for the iterator.

arielb1 · 2016-10-13T16:29:01Z

The optimized IR. Could you post it?

bluss · 2016-10-13T16:55:15Z

optimized IR here, there are two files "old" and "new", same versions as previous comment #36920 (comment)

bluss · 2016-10-13T17:02:53Z

I haven't spotted the difference yet. There's a lot of noise for I think the autovectorization of the fast path, which is not taken for the transposed array. Instead it should run the general/slow case here that calls Dimension::next_for() (which is at least visible in the IR).

arielb1 · 2016-10-13T17:25:50Z

The IRs look identical up to α-renaming.

LLVM: Add triple for Fuchsia Update subproject commit. Fixes #36920

bluss · 2016-10-21T16:15:17Z

This is now in the latest nightly, and all the four testcases produce the same nice code. Thanks a lot for working on it.

bluss mentioned this issue Oct 3, 2016

slice.iter_mut is not zero cost (vs IndexMut) when initializing a fresh vector #31890

Closed

bluss added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Oct 3, 2016

arielb1 added a commit to arielb1/rust that referenced this issue Oct 5, 2016

emit !nonnull metadata on loaded fat pointers when possible

6d54e0e

cc rust-lang#36920 (in addition to LLVM PR30597, should fix the &&[i32] case)

arielb1 mentioned this issue Oct 11, 2016

backport "[InstCombine] Transform !range metadata to !nonnull when combining loads" rust-lang/llvm#55

Merged

arielb1 mentioned this issue Oct 12, 2016

Update LLVM #37125

Closed

arielb1 added a commit to arielb1/rust that referenced this issue Oct 12, 2016

Update LLVM

5ad64fa

Contains backport for my patch, [InstCombine] Transform !range metadata to !nonnull when combining loads. Fixes rust-lang#36920.

raphlinus mentioned this issue Oct 19, 2016

LLVM: Add triple for Fuchsia #37261

Merged

bors added a commit that referenced this issue Oct 20, 2016

Auto merge of #37261 - raphlinus:update_llvm, r=bluss

f094206

LLVM: Add triple for Fuchsia Update subproject commit. Fixes #36920

bors closed this as completed in #37261 Oct 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missed loop optimization for `&Vec` argument in .zip() #36920

Missed loop optimization for `&Vec` argument in .zip() #36920

bluss commented Oct 3, 2016 •

edited

Loading

arielb1 commented Oct 3, 2016 •

edited

Loading

bluss commented Oct 3, 2016

bluss commented Oct 3, 2016

arielb1 commented Oct 3, 2016 •

edited

Loading

arielb1 commented Oct 3, 2016

arielb1 commented Oct 4, 2016

bluss commented Oct 12, 2016 •

edited

Loading

bluss commented Oct 12, 2016 •

edited

Loading

arielb1 commented Oct 12, 2016 •

edited

Loading

bluss commented Oct 12, 2016

arielb1 commented Oct 13, 2016 •

edited

Loading

bluss commented Oct 13, 2016 •

edited

Loading

bluss commented Oct 13, 2016 •

edited

Loading

arielb1 commented Oct 13, 2016

bluss commented Oct 21, 2016

Missed loop optimization for &Vec argument in .zip() #36920

Missed loop optimization for &Vec argument in .zip() #36920

Comments

bluss commented Oct 3, 2016 • edited Loading

arielb1 commented Oct 3, 2016 • edited Loading

bluss commented Oct 3, 2016

bluss commented Oct 3, 2016

arielb1 commented Oct 3, 2016 • edited Loading

arielb1 commented Oct 3, 2016

arielb1 commented Oct 4, 2016

bluss commented Oct 12, 2016 • edited Loading

bluss commented Oct 12, 2016 • edited Loading

arielb1 commented Oct 12, 2016 • edited Loading

bluss commented Oct 12, 2016

arielb1 commented Oct 13, 2016 • edited Loading

bluss commented Oct 13, 2016 • edited Loading

bluss commented Oct 13, 2016 • edited Loading

arielb1 commented Oct 13, 2016

bluss commented Oct 21, 2016

Missed loop optimization for `&Vec` argument in .zip() #36920

Missed loop optimization for `&Vec` argument in .zip() #36920

bluss commented Oct 3, 2016 •

edited

Loading

arielb1 commented Oct 3, 2016 •

edited

Loading

arielb1 commented Oct 3, 2016 •

edited

Loading

bluss commented Oct 12, 2016 •

edited

Loading

bluss commented Oct 12, 2016 •

edited

Loading

arielb1 commented Oct 12, 2016 •

edited

Loading

arielb1 commented Oct 13, 2016 •

edited

Loading

bluss commented Oct 13, 2016 •

edited

Loading

bluss commented Oct 13, 2016 •

edited

Loading