performance regression on high-dimensional array iteration using CartesianIndices (no simd) #38073

johnnychen94 · 2020-10-17T12:15:16Z

It turns out that #37829 has increased iteration performance for 2d array, while slowed down the iteration for higher-dimensional(>=4) array...

julia> using BenchmarkTools

julia> function arr_sum(X)
           val = zero(eltype(X))
           R = CartesianIndices(X)
           for i in R
               @inbounds val += X[i]
           end
           val
       end
arr_sum (generic function with 1 method)

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum($X)
  5.584 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262
  5.790 μs (0 allocations: 0 bytes) # 17a3c7702e2cb20171d1211606343fc50533a588
  3.575 μs (0 allocations: 0 bytes) # 9405bf51a726a6383e6911eeb4235ba21ab3daee
  3.572 μs (0 allocations: 0 bytes) # 1.5.2
  5.959 μs (0 allocations: 0 bytes) # 1.0.5

julia> X = rand(64, 64);

julia> @btime arr_sum($X)
  3.627 μs (0 allocations: 0 bytes) # 17a3c7702e2cb20171d1211606343fc50533a588
  3.734 μs (0 allocations: 0 bytes) # 1.5.2

SIMD and LinearIndices are not affected.

simd

julia> using BenchmarkTools

julia> function arr_sum_simd(X)
           val = zero(eltype(X))
           R = CartesianIndices(X)
           @simd for i in R
               @inbounds val += X[i]
           end
           val
       end
arr_sum_simd (generic function with 1 method)

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum_simd($X)
  3.593 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262
  3.827 μs (0 allocations: 0 bytes) # 1.5.2
  3.585 μs (0 allocations: 0 bytes) # 1.0.5

LinearIndices

julia> using BenchmarkTools

julia> function arr_sum_linear(X)
           val = zero(eltype(X))
           R = LinearIndices(X)
           for i in R
               @inbounds val += X[i]
           end
           val
       end
arr_sum_linear (generic function with 1 method)

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum_linear($X)
  3.707 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262
  3.626 μs (0 allocations: 0 bytes) # 1.5.2
  3.796 μs (0 allocations: 0 bytes) # 1.0.5

johnnychen94 · 2020-10-19T10:52:02Z

This is the benchmark result on array of 4096 elements at various dimensions.

OffsetArrays 1.3.1 is used as a comparison, which uses IdOffsetRange as its axes type. It is very interesting that OffsetArray with Julia 1.6.0-DEV gets relatively the best performance

dataframe

12×4 DataFrame
│ Row │ 1.5.2      │ 1.5.2-OffsetArrays │ 1.6.0-DEV.1262 │ 1.6.0-DEV.1262-OffsetArrays │
│     │ Any        │ Any                │ Any            │ Any                         │
├─────┼────────────┼────────────────────┼────────────────┼─────────────────────────────┤
│ 1   │ 3.70512e-6 │ 3.70813e-6         │ 3.706e-6       │ 3.73338e-6                  │
│ 2   │ 3.99612e-6 │ 3.71112e-6         │ 3.70863e-6     │ 3.73338e-6                  │
│ 3   │ 3.7115e-6  │ 3.71537e-6         │ 3.86275e-6     │ 4.44312e-6                  │
│ 4   │ 3.71663e-6 │ 4.54988e-6         │ 4.746e-6       │ 3.83937e-6                  │
│ 5   │ 3.72325e-6 │ 3.73e-6            │ 4.20529e-6     │ 3.84325e-6                  │
│ 6   │ 3.72862e-6 │ 4.00812e-6         │ 5.94833e-6     │ 4.15257e-6                  │
│ 7   │ 3.73425e-6 │ 3.73938e-6         │ 6.735e-6       │ 3.7555e-6                   │
│ 8   │ 8.007e-6   │ 9.16033e-6         │ 7.81225e-6     │ 3.84125e-6                  │
│ 9   │ 9.262e-6   │ 1.1286e-5          │ 8.5115e-6      │ 3.8475e-6                   │
│ 10  │ 1.0439e-5  │ 1.191e-5           │ 1.1283e-5      │ 4.20012e-6                  │
│ 11  │ 1.1511e-5  │ 1.3294e-5          │ 1.2812e-5      │ 4.76257e-6                  │
│ 12  │ 1.3063e-5  │ 1.6122e-5          │ 1.2355e-5      │ 5.18617e-6                  │

test script

using Test
using BenchmarkTools
using Plots
using Random
using OffsetArrays # 1.3.1

function arr_sum(X)
    val = zero(eltype(X))
    R = CartesianIndices(X)
    for i in R
        @inbounds val += X[i]
    end
    val
end

sz_list = [
    (4096, ),
    (2048, 2),
    (1024, 2, 2),
    ( 512, 2, 2, 2),
    ( 256, 2, 2, 2, 2),
    ( 128, 2, 2, 2, 2, 2),
    (  64, 2, 2, 2, 2, 2, 2),
    (  32, 2, 2, 2, 2, 2, 2, 2),
    (  16, 2, 2, 2, 2, 2, 2, 2, 2),
    (   8, 2, 2, 2, 2, 2, 2, 2, 2, 2),
    (   4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
    (   2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
]
X = axes(sz_list, 1)

### run in Julia 1.5.2
sz = sz_list[2]
A = rand(sz...);
AO = OffsetArray(A, OffsetArrays.Origin(1));
@assert arr_sum(A) == arr_sum(AO)

T15O = []
for sz in sz_list
    t = @belapsed arr_sum(A) setup=(A=OffsetArray(rand($sz...), OffsetArrays.Origin(1)))
    @show sz t
    push!(T15O, t)
end
T15 = []
for sz in sz_list
    t = @belapsed arr_sum(A) setup=(A=rand($sz...))
    @show sz t
    push!(T15, t)
end

### run in Julia 1.6.0-DEV
sz = sz_list[2]
Random.seed!(0)
A = rand(sz...);
AO = OffsetArray(A, OffsetArrays.Origin(1));
@assert arr_sum(A) == arr_sum(AO)

T16O = []
for sz in sz_list
    t = @belapsed arr_sum(A) setup=(A=OffsetArray(rand($sz...), OffsetArrays.Origin(1)))
    @show sz t
    push!(T16O, t)
end
T16 = []
for sz in sz_list
    t = @belapsed arr_sum(A) setup=(A=rand($sz...))
    @show sz t
    push!(T16, t)
end

scatter(X, [T15, T16], labels="Array", markershape=:circle)
scatter!(X, [T15O, T16O], labels="OffsetArray", markershape=:rect)
plot!(X, [T15, T15O], labels="1.5.2", linestyle=:dash)
plot!(X, [T16, T16O], labels="1.6.0-DEV.1262", legend=:topleft)

timholy · 2020-10-19T10:58:55Z

Interesting. So basically everything in 1.6 is no worse than 1.5, with the minor exception of Array in 6 and 7 dimensions?

kimikage · 2020-10-24T04:32:30Z

I read #38086 (comment) and wondered if there was something going on with the bounds check (or inbounds propagation).

function arr_sum_both(X)
    val = zero(eltype(X))
    R = CartesianIndices(X)
    @inbounds for i in R
        @inbounds val += X[i]
    end
    val
end

function arr_sum_outeronly(X)
    val = zero(eltype(X))
    R = CartesianIndices(X)
    @inbounds for i in R
        val += X[i]
    end
    val
end

julia> VERSION
v"1.6.0-DEV.1322"

julia> @btime arr_sum($X);
  5.033 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_both($X);
  5.033 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_outeronly($X);
  5.267 μs (0 allocations: 0 bytes)

This may be a separate issue, but it is weird that arr_sum_outeronly is slower.

johnnychen94 · 2020-12-17T23:49:41Z

This is what I get now with two repeated benchmarks:

julia> VERSION
v"1.7.0-DEV.36"

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum($X);
  5.540 μs (0 allocations: 0 bytes)
  5.538 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_both($X);
  5.221 μs (0 allocations: 0 bytes)
  5.582 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_outeronly($X);
  5.110 μs (0 allocations: 0 bytes)
  5.223 μs (0 allocations: 0 bytes)

This difference might just be noises.

kimikage · 2020-12-19T02:26:24Z

I also checked again and found no difference in the native code depending on the position of @inbounds.
However, it seems certain that the noise is not random. The internal state of the CPU, such as the caches, may affect it in this case.

On the other hand, in the case of #38086 (comment), there is a clear difference.

KristofferC · 2021-03-29T19:50:13Z

Seems like this caused a similar problem in https://discourse.julialang.org/t/drop-of-performances-with-julia-1-6-0-for-interpolationkernels/58085 as was fixed in #39333.

https://discourse.julialang.org/t/drop-of-performances-with-julia-1-6-0-for-interpolationkernels/58085/12 has an MWE.

Adding @inbouds @simd outside the Cartesian loop seems to work arund it.

kimikage · 2021-03-29T21:17:22Z

IIRC, this might have something to do with #39700 (comment)

However, I have no knowledge about the countermeasures on the compiler side.

johnnychen94 mentioned this issue Oct 17, 2020

getindex overhead JuliaArrays/OffsetArrays.jl#166

Open

johnnychen94 mentioned this issue Apr 27, 2021

~2x slowdown from 1.5.4 -> 1.6 #40619

Closed

N5N3 mentioned this issue Jul 2, 2021

A faster copyto_unaliased! #41434

Closed

johnnychen94 mentioned this issue Jul 5, 2021

Faster iteration for an IdOffsetRange JuliaArrays/OffsetArrays.jl#215

Merged

brenhinkeller added performance Must go faster compiler:simd instruction-level vectorization labels Nov 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance regression on high-dimensional array iteration using CartesianIndices (no simd) #38073

performance regression on high-dimensional array iteration using CartesianIndices (no simd) #38073

johnnychen94 commented Oct 17, 2020 •

edited

Loading

johnnychen94 commented Oct 19, 2020 •

edited

Loading

timholy commented Oct 19, 2020

kimikage commented Oct 24, 2020

johnnychen94 commented Dec 17, 2020 •

edited

Loading

kimikage commented Dec 19, 2020 •

edited

Loading

KristofferC commented Mar 29, 2021 •

edited

Loading

kimikage commented Mar 29, 2021

performance regression on high-dimensional array iteration using CartesianIndices (no simd) #38073

performance regression on high-dimensional array iteration using CartesianIndices (no simd) #38073

Comments

johnnychen94 commented Oct 17, 2020 • edited Loading

johnnychen94 commented Oct 19, 2020 • edited Loading

timholy commented Oct 19, 2020

kimikage commented Oct 24, 2020

johnnychen94 commented Dec 17, 2020 • edited Loading

kimikage commented Dec 19, 2020 • edited Loading

KristofferC commented Mar 29, 2021 • edited Loading

kimikage commented Mar 29, 2021

johnnychen94 commented Oct 17, 2020 •

edited

Loading

johnnychen94 commented Oct 19, 2020 •

edited

Loading

johnnychen94 commented Dec 17, 2020 •

edited

Loading

kimikage commented Dec 19, 2020 •

edited

Loading

KristofferC commented Mar 29, 2021 •

edited

Loading