Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance regression on high-dimensional array iteration using CartesianIndices (no simd) #38073

Open
johnnychen94 opened this issue Oct 17, 2020 · 7 comments
Labels
compiler:simd instruction-level vectorization performance Must go faster

Comments

@johnnychen94
Copy link
Sponsor Member

johnnychen94 commented Oct 17, 2020

It turns out that #37829 has increased iteration performance for 2d array, while slowed down the iteration for higher-dimensional(>=4) array...

julia> using BenchmarkTools

julia> function arr_sum(X)
           val = zero(eltype(X))
           R = CartesianIndices(X)
           for i in R
               @inbounds val += X[i]
           end
           val
       end
arr_sum (generic function with 1 method)

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum($X)
  5.584 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262
  5.790 μs (0 allocations: 0 bytes) # 17a3c7702e2cb20171d1211606343fc50533a588
  3.575 μs (0 allocations: 0 bytes) # 9405bf51a726a6383e6911eeb4235ba21ab3daee
  3.572 μs (0 allocations: 0 bytes) # 1.5.2
  5.959 μs (0 allocations: 0 bytes) # 1.0.5

julia> X = rand(64, 64);

julia> @btime arr_sum($X)
  3.627 μs (0 allocations: 0 bytes) # 17a3c7702e2cb20171d1211606343fc50533a588
  3.734 μs (0 allocations: 0 bytes) # 1.5.2

SIMD and LinearIndices are not affected.

simd
julia> using BenchmarkTools

julia> function arr_sum_simd(X)
           val = zero(eltype(X))
           R = CartesianIndices(X)
           @simd for i in R
               @inbounds val += X[i]
           end
           val
       end
arr_sum_simd (generic function with 1 method)

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum_simd($X)
  3.593 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262
  3.827 μs (0 allocations: 0 bytes) # 1.5.2
  3.585 μs (0 allocations: 0 bytes) # 1.0.5
LinearIndices
julia> using BenchmarkTools

julia> function arr_sum_linear(X)
           val = zero(eltype(X))
           R = LinearIndices(X)
           for i in R
               @inbounds val += X[i]
           end
           val
       end
arr_sum_linear (generic function with 1 method)

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum_linear($X)
  3.707 μs (0 allocations: 0 bytes) # 1.6.0-DEV.1262
  3.626 μs (0 allocations: 0 bytes) # 1.5.2
  3.796 μs (0 allocations: 0 bytes) # 1.0.5
@johnnychen94
Copy link
Sponsor Member Author

johnnychen94 commented Oct 19, 2020

This is the benchmark result on array of 4096 elements at various dimensions.

OffsetArrays 1.3.1 is used as a comparison, which uses IdOffsetRange as its axes type. It is very interesting that OffsetArray with Julia 1.6.0-DEV gets relatively the best performance

test

dataframe
12×4 DataFrame
│ Row │ 1.5.2      │ 1.5.2-OffsetArrays │ 1.6.0-DEV.1262 │ 1.6.0-DEV.1262-OffsetArrays │
│     │ Any        │ Any                │ Any            │ Any                         │
├─────┼────────────┼────────────────────┼────────────────┼─────────────────────────────┤
│ 1   │ 3.70512e-6 │ 3.70813e-6         │ 3.706e-6       │ 3.73338e-6                  │
│ 2   │ 3.99612e-6 │ 3.71112e-6         │ 3.70863e-6     │ 3.73338e-6                  │
│ 3   │ 3.7115e-6  │ 3.71537e-6         │ 3.86275e-6     │ 4.44312e-6                  │
│ 4   │ 3.71663e-6 │ 4.54988e-6         │ 4.746e-6       │ 3.83937e-6                  │
│ 5   │ 3.72325e-6 │ 3.73e-6            │ 4.20529e-6     │ 3.84325e-6                  │
│ 6   │ 3.72862e-6 │ 4.00812e-6         │ 5.94833e-6     │ 4.15257e-6                  │
│ 7   │ 3.73425e-6 │ 3.73938e-6         │ 6.735e-6       │ 3.7555e-6                   │
│ 8   │ 8.007e-6   │ 9.16033e-6         │ 7.81225e-6     │ 3.84125e-6                  │
│ 9   │ 9.262e-6   │ 1.1286e-5          │ 8.5115e-6      │ 3.8475e-6                   │
│ 10  │ 1.0439e-5  │ 1.191e-5           │ 1.1283e-5      │ 4.20012e-6                  │
│ 11  │ 1.1511e-5  │ 1.3294e-5          │ 1.2812e-5      │ 4.76257e-6                  │
│ 12  │ 1.3063e-5  │ 1.6122e-5          │ 1.2355e-5      │ 5.18617e-6                  │
test script
using Test
using BenchmarkTools
using Plots
using Random
using OffsetArrays # 1.3.1

function arr_sum(X)
    val = zero(eltype(X))
    R = CartesianIndices(X)
    for i in R
        @inbounds val += X[i]
    end
    val
end

sz_list = [
    (4096, ),
    (2048, 2),
    (1024, 2, 2),
    ( 512, 2, 2, 2),
    ( 256, 2, 2, 2, 2),
    ( 128, 2, 2, 2, 2, 2),
    (  64, 2, 2, 2, 2, 2, 2),
    (  32, 2, 2, 2, 2, 2, 2, 2),
    (  16, 2, 2, 2, 2, 2, 2, 2, 2),
    (   8, 2, 2, 2, 2, 2, 2, 2, 2, 2),
    (   4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
    (   2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
]
X = axes(sz_list, 1)

### run in Julia 1.5.2
sz = sz_list[2]
A = rand(sz...);
AO = OffsetArray(A, OffsetArrays.Origin(1));
@assert arr_sum(A) == arr_sum(AO)

T15O = []
for sz in sz_list
    t = @belapsed arr_sum(A) setup=(A=OffsetArray(rand($sz...), OffsetArrays.Origin(1)))
    @show sz t
    push!(T15O, t)
end
T15 = []
for sz in sz_list
    t = @belapsed arr_sum(A) setup=(A=rand($sz...))
    @show sz t
    push!(T15, t)
end

### run in Julia 1.6.0-DEV
sz = sz_list[2]
Random.seed!(0)
A = rand(sz...);
AO = OffsetArray(A, OffsetArrays.Origin(1));
@assert arr_sum(A) == arr_sum(AO)

T16O = []
for sz in sz_list
    t = @belapsed arr_sum(A) setup=(A=OffsetArray(rand($sz...), OffsetArrays.Origin(1)))
    @show sz t
    push!(T16O, t)
end
T16 = []
for sz in sz_list
    t = @belapsed arr_sum(A) setup=(A=rand($sz...))
    @show sz t
    push!(T16, t)
end

scatter(X, [T15, T16], labels="Array", markershape=:circle)
scatter!(X, [T15O, T16O], labels="OffsetArray", markershape=:rect)
plot!(X, [T15, T15O], labels="1.5.2", linestyle=:dash)
plot!(X, [T16, T16O], labels="1.6.0-DEV.1262", legend=:topleft)

@timholy
Copy link
Sponsor Member

timholy commented Oct 19, 2020

Interesting. So basically everything in 1.6 is no worse than 1.5, with the minor exception of Array in 6 and 7 dimensions?

@kimikage
Copy link
Contributor

I read #38086 (comment) and wondered if there was something going on with the bounds check (or inbounds propagation).

function arr_sum_both(X)
    val = zero(eltype(X))
    R = CartesianIndices(X)
    @inbounds for i in R
        @inbounds val += X[i]
    end
    val
end

function arr_sum_outeronly(X)
    val = zero(eltype(X))
    R = CartesianIndices(X)
    @inbounds for i in R
        val += X[i]
    end
    val
end
julia> VERSION
v"1.6.0-DEV.1322"

julia> @btime arr_sum($X);
  5.033 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_both($X);
  5.033 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_outeronly($X);
  5.267 μs (0 allocations: 0 bytes)

This may be a separate issue, but it is weird that arr_sum_outeronly is slower.

@johnnychen94
Copy link
Sponsor Member Author

johnnychen94 commented Dec 17, 2020

This is what I get now with two repeated benchmarks:

julia> VERSION
v"1.7.0-DEV.36"

julia> X = rand(4, 4, 4, 4, 4, 4);

julia> @btime arr_sum($X);
  5.540 μs (0 allocations: 0 bytes)
  5.538 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_both($X);
  5.221 μs (0 allocations: 0 bytes)
  5.582 μs (0 allocations: 0 bytes)

julia> @btime arr_sum_outeronly($X);
  5.110 μs (0 allocations: 0 bytes)
  5.223 μs (0 allocations: 0 bytes)

This difference might just be noises.

@kimikage
Copy link
Contributor

kimikage commented Dec 19, 2020

I also checked again and found no difference in the native code depending on the position of @inbounds.
However, it seems certain that the noise is not random. The internal state of the CPU, such as the caches, may affect it in this case.

On the other hand, in the case of #38086 (comment), there is a clear difference.

@KristofferC
Copy link
Sponsor Member

KristofferC commented Mar 29, 2021

Seems like this caused a similar problem in https://discourse.julialang.org/t/drop-of-performances-with-julia-1-6-0-for-interpolationkernels/58085 as was fixed in #39333.

https://discourse.julialang.org/t/drop-of-performances-with-julia-1-6-0-for-interpolationkernels/58085/12 has an MWE.

Adding @inbouds @simd outside the Cartesian loop seems to work arund it.

@kimikage
Copy link
Contributor

IIRC, this might have something to do with #39700 (comment)

However, I have no knowledge about the countermeasures on the compiler side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:simd instruction-level vectorization performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants