Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance gap in assigning to a subsection of an array with and without broadcasting #40962

Open
jishnub opened this issue May 26, 2021 · 6 comments
Labels
arrays [a, r, r, a, y, s] broadcast Applying a function over a collection compiler:simd instruction-level vectorization performance Must go faster

Comments

@jishnub
Copy link
Contributor

jishnub commented May 26, 2021

julia> a = rand(40,40); b = rand(40,40);

julia> @btime $a[1:end,1:end] .= $b;
  5.382 μs (0 allocations: 0 bytes)

julia> @btime $a[1:end,1:end] = $b;
  2.107 μs (0 allocations: 0 bytes)

Presumably, the destination is a Base.SlowSubArray for the broadcasted assignment.

Interestingly, the performance is reversed if the index ranges are replaced by colons. (in which case the destination is a FastContiguousSubArray)

julia> @btime $a[:,:] .= $b;
  753.160 ns (0 allocations: 0 bytes)

julia> @btime $a[:,:] = $b;
  2.125 μs (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 1.11.0-DEV.1397
Commit 0588cd40786 (2024-01-29 02:21 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, tigerlake)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  LD_LIBRARY_PATH = :/usr/lib/x86_64-linux-gnu/gtk-3.0/modules
  JULIA_EDITOR = subl
@jishnub jishnub changed the title Broadcasted setindex! to a subsection of an array is much slower than setindex! without broadcasting Significant performance gap in assigning to a subsection of an array with and without broadcasting May 26, 2021
@stevengj stevengj added arrays [a, r, r, a, y, s] performance Must go faster labels Jun 10, 2021
@N5N3
Copy link
Member

N5N3 commented Jul 1, 2021

Well, the improved performance from replacing a[1:end,1:end] with a[:,:] does not surprise me, as the invoked copyto! is not well tuned now (see #39345).

If we replace separated iteration with zip, then the performance of copyto! will be improved. BTW, a simd version is even faster.

Some benchmark:

a = rand(40,40); b = rand(40,40);
@btime $a[:,:] = $b          # 1.330 μs (0 allocations: 0 bytes)
@btime $a[1:end,1:end] = $b  # 1.290 μs (0 allocations: 0 bytes)
@btime $a[:,:] .= $b;        # 385.859 ns (0 allocations: 0 bytes)
@btime $a[1:end,1:end] .= $b # 3.056 μs (0 allocations: 0 bytes) fall into a slow banch
a′ = @view a[1:end,1:end]
@btime zip_copyto!($a′, $b); # 1.480 μs (0 allocations: 0 bytes) much faster, almost as fast as setindex!
@btime simd_copyto!($a′, $b); # 234.332 ns (0 allocations: 0 bytes) fastest, I believe that the general broadcast has a close speed

I tried to implement a faster copyto_unaliased! based on @simd and zip. (Without @simd, the generated code won't be vectorized in the above case.)
But it seems we can't use this tools in this function?

@LilithHafner
Copy link
Member

Is this the same issue?

u = rand(UInt, 10_000); lo = 1; hi = 10_000; u_min = 1729;
fast(u, lo, hi, u_min) = (@inbounds for i in lo:hi u[i] -= u_min end; u)
slow(u, lo, hi, u_min) = (@inbounds u[lo:hi] .-= u_min; u)

@belapsed fast($u, $lo, $hi, $u_min) # 1.1795e-6
@belapsed slow($u, $lo, $hi, $u_min) # 7.82e-6

@N5N3
Copy link
Member

N5N3 commented Mar 14, 2022

Is #44230 (comment) the same issue?

No, your example is #43153

@brenhinkeller
Copy link
Sponsor Contributor

brenhinkeller commented Nov 19, 2022

Possibly related to #28126, #45487, etc.

@brenhinkeller brenhinkeller added broadcast Applying a function over a collection compiler:simd instruction-level vectorization labels Nov 19, 2022
@jishnub
Copy link
Contributor Author

jishnub commented Feb 1, 2024

Interestingly, the performance is reversed for vectors, in which case it's faster to broadcast:

julia> A = zeros(1000); B = rand(1000);

julia> @btime $A[1:end] = @view $B[1:end];
  734.116 ns (0 allocations: 0 bytes)

julia> @btime $A[1:end] .= @view $B[1:end];
  122.402 ns (0 allocations: 0 bytes)

julia> @btime $A[:] = @view $B[:];
  1.014 μs (0 allocations: 0 bytes)

julia> @btime $A[:] .= @view $B[:];
  123.959 ns (0 allocations: 0 bytes)

setindex! using colons becomes significantly faster after #52809 (and is faster than the broadcast assignment for both vectors and matrices). On that PR:

julia> @btime $A[:] = @view $B[:];
  103.745 ns (0 allocations: 0 bytes)

julia> @btime $A[:] .= @view $B[:];
  113.488 ns (0 allocations: 0 bytes)

and for the matrices in the original post:

julia> @btime $a[:,:] .= $b;
  753.387 ns (0 allocations: 0 bytes)

julia> @btime $a[:,:] = $b;
  618.081 ns (0 allocations: 0 bytes)

@N5N3
Copy link
Member

N5N3 commented Feb 1, 2024

Well, @btime $A[1:end] = @view $B[1:end]; is dispatched to

function setindex!(A::Array, X::AbstractArray, I::AbstractVector{Int})
    @_propagate_inbounds_meta
    @boundscheck setindex_shape_check(X, length(I))
    require_one_based_indexing(X)
    X′ = unalias(A, X)
    I′ = unalias(A, I)
    count = 1
    for i in I′
        @inbounds x = X′[count]
        A[i] = x
        count += 1
    end
    return A
end

apparently the boundcheck of A[i] = x blocks vectorlization here.

KristofferC pushed a commit that referenced this issue Feb 2, 2024
As noted by @N5N3 in
#40962 (comment),
the bounds-check on the elementwise `setindex!` prevents vectorization.
Explicitly performing the bounds-check and marking the `setindex!` as
`@inbounds` speeds up the operation.
```julia
julia> A = zeros(1000); B = rand(1000);

julia> @Btime $A[1:end] = @view $B[1:end];
  689.940 ns (0 allocations: 0 bytes) # master
  97.629 ns (0 allocations: 0 bytes) # PR
```
vtjnash pushed a commit that referenced this issue Feb 20, 2024
…3383)

With this, the following (and equivalent calls) work:
```julia
julia> copyto!(view(zeros(BigInt, 2), 1:2), Vector{BigInt}(undef,2))
2-element view(::Vector{BigInt}, 1:2) with eltype BigInt:
 #undef
 #undef

julia> copyto!(view(zeros(BigInt, 2), 1:2), view(Vector{BigInt}(undef,2), 1:2))
2-element view(::Vector{BigInt}, 1:2) with eltype BigInt:
 #undef
 #undef
```

Close #53098. With this, all
the `_unsetindex!` branches in `copyto_unaliased!` work for
`Array`-views, and this makes certain indexing operations vectorize and
speed-up:
```julia
julia> using BenchmarkTools

julia> a = view(rand(100,100), 1:100, 1:100); b = view(similar(a), axes(a)...);

julia> @Btime copyto!($b, $a);
  16.427 μs (0 allocations: 0 bytes) # master
  2.308 μs (0 allocations: 0 bytes) # PR
``` 

Improves (but doesn't resolve)
#40962 and
#53158

```julia
julia> a = rand(40,40); b = rand(40,40);

julia> @Btime $a[1:end,1:end] .= $b;
  5.383 μs (0 allocations: 0 bytes) # v"1.12.0-DEV.16"
  3.194 μs (0 allocations: 0 bytes) # PR
```
ƒ
Co-authored-by: Jameson Nash <[email protected]>
tecosaur pushed a commit to tecosaur/julia that referenced this issue Mar 4, 2024
…liaLang#53383)

With this, the following (and equivalent calls) work:
```julia
julia> copyto!(view(zeros(BigInt, 2), 1:2), Vector{BigInt}(undef,2))
2-element view(::Vector{BigInt}, 1:2) with eltype BigInt:
 #undef
 #undef

julia> copyto!(view(zeros(BigInt, 2), 1:2), view(Vector{BigInt}(undef,2), 1:2))
2-element view(::Vector{BigInt}, 1:2) with eltype BigInt:
 #undef
 #undef
```

Close JuliaLang#53098. With this, all
the `_unsetindex!` branches in `copyto_unaliased!` work for
`Array`-views, and this makes certain indexing operations vectorize and
speed-up:
```julia
julia> using BenchmarkTools

julia> a = view(rand(100,100), 1:100, 1:100); b = view(similar(a), axes(a)...);

julia> @Btime copyto!($b, $a);
  16.427 μs (0 allocations: 0 bytes) # master
  2.308 μs (0 allocations: 0 bytes) # PR
``` 

Improves (but doesn't resolve)
JuliaLang#40962 and
JuliaLang#53158

```julia
julia> a = rand(40,40); b = rand(40,40);

julia> @Btime $a[1:end,1:end] .= $b;
  5.383 μs (0 allocations: 0 bytes) # v"1.12.0-DEV.16"
  3.194 μs (0 allocations: 0 bytes) # PR
```
ƒ
Co-authored-by: Jameson Nash <[email protected]>
mkitti pushed a commit to mkitti/julia that referenced this issue Mar 7, 2024
…liaLang#53383)

With this, the following (and equivalent calls) work:
```julia
julia> copyto!(view(zeros(BigInt, 2), 1:2), Vector{BigInt}(undef,2))
2-element view(::Vector{BigInt}, 1:2) with eltype BigInt:
 #undef
 #undef

julia> copyto!(view(zeros(BigInt, 2), 1:2), view(Vector{BigInt}(undef,2), 1:2))
2-element view(::Vector{BigInt}, 1:2) with eltype BigInt:
 #undef
 #undef
```

Close JuliaLang#53098. With this, all
the `_unsetindex!` branches in `copyto_unaliased!` work for
`Array`-views, and this makes certain indexing operations vectorize and
speed-up:
```julia
julia> using BenchmarkTools

julia> a = view(rand(100,100), 1:100, 1:100); b = view(similar(a), axes(a)...);

julia> @Btime copyto!($b, $a);
  16.427 μs (0 allocations: 0 bytes) # master
  2.308 μs (0 allocations: 0 bytes) # PR
``` 

Improves (but doesn't resolve)
JuliaLang#40962 and
JuliaLang#53158

```julia
julia> a = rand(40,40); b = rand(40,40);

julia> @Btime $a[1:end,1:end] .= $b;
  5.383 μs (0 allocations: 0 bytes) # v"1.12.0-DEV.16"
  3.194 μs (0 allocations: 0 bytes) # PR
```
ƒ
Co-authored-by: Jameson Nash <[email protected]>
KristofferC pushed a commit that referenced this issue Mar 27, 2024
…3383)

With this, the following (and equivalent calls) work:
```julia
julia> copyto!(view(zeros(BigInt, 2), 1:2), Vector{BigInt}(undef,2))
2-element view(::Vector{BigInt}, 1:2) with eltype BigInt:
 #undef
 #undef

julia> copyto!(view(zeros(BigInt, 2), 1:2), view(Vector{BigInt}(undef,2), 1:2))
2-element view(::Vector{BigInt}, 1:2) with eltype BigInt:
 #undef
 #undef
```

Close #53098. With this, all
the `_unsetindex!` branches in `copyto_unaliased!` work for
`Array`-views, and this makes certain indexing operations vectorize and
speed-up:
```julia
julia> using BenchmarkTools

julia> a = view(rand(100,100), 1:100, 1:100); b = view(similar(a), axes(a)...);

julia> @Btime copyto!($b, $a);
  16.427 μs (0 allocations: 0 bytes) # master
  2.308 μs (0 allocations: 0 bytes) # PR
```

Improves (but doesn't resolve)
#40962 and
#53158

```julia
julia> a = rand(40,40); b = rand(40,40);

julia> @Btime $a[1:end,1:end] .= $b;
  5.383 μs (0 allocations: 0 bytes) # v"1.12.0-DEV.16"
  3.194 μs (0 allocations: 0 bytes) # PR
```
ƒ
Co-authored-by: Jameson Nash <[email protected]>

(cherry picked from commit 1a90409)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] broadcast Applying a function over a collection compiler:simd instruction-level vectorization performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants