A faster copyto_unaliased! #41434

N5N3 · 2021-07-01T15:20:45Z

At present, a .= b falls to copyto!(a, b) if axes(a) == axes(b) for performance optimization.
However, as mentioned in #40962 and #39345, the performance of copyto!(a, b) is very low in many cases. Thus, I think we can implement a faster version here, based on @simd, to solve the problem to some extent.
(I tried to to optimize copyto_unalias!, but many generated codes are not vectorized without @simd.)
This PR tries to speed up copyto_unaliased! in many cases:

Linear-to-Linear copy is using Base.OneTo as the iterartor instead of LinearIndices for faster speed.
zip() based iterator is used to speed up the most general case.
The innermost loop for CartesianIndices is separated mannully like @simd (but add no Expr(:loopinfo) to speed up if the 1st dim's length is larger than 16.

size:(16, 2048) -> size:(16, 2048)
Car -> Car : before: 17.5μs -> after: 8.5μs
Car -> Lin : before: 52.2μs -> after: 8.5μs
Lin -> Car : before: 75.5μs -> after: 8.77μs
Lin -> Lin : before: 10.6μs -> after: 8.3μs
size:(16, 4, 512) -> size:(16, 4, 512)
Car -> Car : before: 18μs -> after: 9.5μs
Car -> Lin : before: 50.1μs -> after: 9.3μs
Lin -> Car : before: 82μs -> after: 11.6μs
Lin -> Lin : before: 10.7μs -> after: 8.37μs
size:(16, 4, 4, 128) -> size:(16, 4, 4, 128)
Car -> Car : before: 19.7μs -> after: 9.4μs
Car -> Lin : before: 79.3μs -> after: 9.2μs
Lin -> Car : before: 95.8μs -> after: 9.2μs
Lin -> Lin : before: 10.6μs -> after: 8.37μs
size:(16, 4, 4, 4, 32) -> size:(16, 4, 4, 4, 32)
Car -> Car : before: 20.4μs -> after: 9.8μs
Car -> Lin : before: 94.9μs -> after: 9.1μs
Lin -> Car : before: 119μs -> after: 10μs
Lin -> Lin : before: 10.6μs -> after: 8.37μs

The above Benchmarks, using SubArray{Float64}, are done in 1.7.0-beta3, where the performance regression on CartesianIndices's iteration(#38073) seems to be fixed.

julia> a = randn(50,50); b = randn(50,50); # 2D case # before julia> @Btime $a[1:end,1:end] .= $b; 4.900 μs (0 allocations: 0 bytes) julia> @Btime $a[:,:] .= $b; 721.642 ns (0 allocations: 0 bytes) # after julia> @Btime $a[1:end,1:end] .= $b; 355.238 ns (0 allocations: 0 bytes) julia> @Btime $a[:,:] .= $b; 351.643 ns (0 allocations: 0 bytes) julia> a = randn(50*50); b = randn(50*50); #1D case # before julia> @Btime $a .= $b; 406.566 ns (0 allocations: 0 bytes) # after julia> @Btime $a .= $b; 270.607 ns (0 allocations: 0 bytes)

N5N3 · 2021-07-01T16:48:18Z

The failure in mpfr seems unrelated？

oscardssmith · 2021-07-01T19:32:00Z

Can you add a 3d benchmark? Other than that, this looks good.

N5N3 · 2021-07-02T01:08:07Z

It seems the speed of the general broadcast kernal is strongly affected by the array's shape.
50*50 is almost 5 times faster than 5*500.
Maybe the exclude in preprocess should be omitted in this case for better performance?
~~I'll make more test.~~
Without exclude things seems much better, so a better way is to implement faster copyto_unaliased!?

oscardssmith · 2021-07-02T05:43:59Z

Why did you close this?

N5N3 · 2021-07-02T05:53:17Z

Just for reverting the branch, and try to speed up copyto_unalias!.
The acceleration is considerable on my Desktop, but the code might be unclear (include some @simd that should not appear)

N5N3 · 2021-07-02T06:35:28Z

Opps, it seems we can't use @simd in abstractarray.jl. But it does speed up a lot.
I looked into the native code, without @simd many generated code will not be vectorized. (even copy a LinearIndexed array to another LinearIndexed array).
I can't find a solution, so the optimization is limited to a .= b.
If the regression of copyto(b,a) is resolved, this seperatred implement could be reverted.

revert

stevengj · 2021-07-03T19:26:56Z

base/broadcast.jl

@@ -991,11 +991,17 @@ preprocess_args(dest, args::Tuple{}) = ()
 # Specialize this method if all you want to do is specialize on typeof(dest)
 @inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
    axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
-    # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
+    # Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match.
+    # However copyto!(dest, A) is very slow in many cases, implement a faster version here.


Wouldn't it be better to speed up copyto!(dest, A)?

In fact, I tried, but the problem seems to be a little complicated with the following facts:

copyto!(::Array,::Array) call C's memmove, but it seems slower than a single loop:

julia> a = randn(1000); b = similar(a); julia> @btime copyto!($b,$a); # this call C's memmove 120.524 ns (0 allocations: 0 bytes) julia> @btime copyto!(IndexLinear(),$b,IndexLinear(),$a); # this call Julia's copyto_unalias! 74.512 ns (0 allocations: 0 bytes)

It seems we can use C's memcpy instead of memmove to solve this.

For abstractarray, I can't find a way to accelerate copyto_unalias!(dest, A) without @simd, as:

a = randn(40,40); b = similar(a); f1!(a,b) = @inbounds for i in eachindex(IndexCartesian(), a) b[i] = a[i] end f2!(a,b) = @inbounds @simd for i in eachindex(IndexCartesian(), a) b[i] = a[i] end @btime f1!($a, $b) #247.132 ns (0 allocations: 0 bytes) @btime f2!($a, $b) #220.548 ns (0 allocations: 0 bytes) a little faster @btime @views f1!($a[1:end,1:end], $b[1:end,1:end]) # 1.710 μs (0 allocations: 0 bytes) @btime @views f2!($a[1:end,1:end], $b[1:end,1:end]) # 262.500 ns (0 allocations: 0 bytes) much faster

Why can't you use @simd?

The buildbot throw an error that @simd is not defined. I‘m not familiar with the build system, but it seems to be a world age problem?(abstractarray.jl is included before simdloop.jl in Base.jl)

Sounds just like a bootstrapping problem, i.e. @simd is not defined until later in the build process. That should be fixable by e.g. by moving the copyto! definition to a file loaded later.

Put SimdLoop in advance to avoid bootstrapping (I hope this work)

Focus on copyto_unalias!, put SimdLoop in advance.

Move some copyto! definitions to other place.

KristofferC · 2021-07-05T15:09:21Z

I think this is #38073 cc @kimikage

On 1.5:

julia> @btime foo!($b, $a) #  6.480 μs (0 allocations: 0 bytes)
  1.658 μs (0 allocations: 0 bytes)

julia> @btime simdfoo!($b, $a) #  2.000 μs (0 allocations: 0 bytes)
  2.368 μs (0 allocations: 0 bytes)

On 1.6:

julia> @btime foo!($b, $a) #  6.480 μs (0 allocations: 0 bytes)
  6.112 μs (0 allocations: 0 bytes)

julia> @btime simdfoo!($b, $a) #  2.000 μs (0 allocations: 0 bytes)
  1.768 μs (0 allocations: 0 bytes)

N5N3 · 2021-07-06T02:49:25Z

Some more benchmarks, It seems 1.6.1 is stably slower than 1.5.4 even with a @simd on 4，5，6D CartesianIndices's iteration .
But at least @simd won't make performance worse now.

julia> a = randn(4,4,4,4,4,4); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 9.200 μs      1.5.4: 3.112 μs
julia> @btime simdfoo!($b,$a);        #  1.6.1: 3.763 μs      1.5.4: 6.040 μs

julia> a = randn(5,5,5,5,5); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 6.000 μs      1.5.4: 2.144 μs
julia> @btime simdfoo!($b,$a);        #  1.6.1: 2.378 μs      1.5.4: 3.287 μs

julia> a = randn(8,8,8,8); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 6.420 μs      1.5.4: 1.840 μs
julia> @btime simdfoo!($b,$a);        #  1.6.1: 1.990 μs      1.5.4: 2.678 μs

julia> a = randn(16,16,16); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 5.833 μs      1.5.4: 961.857 ns
julia> @btime simdfoo!($b,$a);        #  1.6.1: 747.619 ns    1.5.4: 999.900 ns

julia> a = randn(64,64); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 3.737 μs      1.5.4: 5.650 μs
julia> @btime simdfoo!($b,$a);        #  1.6.1: 512.500 ns    1.5.4: 699.306 ns

N5N3 · 2021-07-06T04:37:49Z

@mcabbott

There were concerns raised there about arrays like BitArray which alias themselves; I don't know whether or not this PR avoids that. (E.g. I don't know whether messing with Expr(:loopinfo is identical to @simd.)

I tried to remove every Expr(:loopinfo generated by @simd while keeping the loop body.
The results shows that the speed is a little bit influenced in some cases, but the acceleration is still considerable.
Since the expanded code is equivalent, I think this is safe enough?
Maybe we can add a safe option for @simd like:

@simd safe for i in 1:10
end

which only expands the loop but not marks it as SIMD.

Some benchmark:

size:(8, 8, 8, 8) -> size:(16, 16, 16)
Car -> Car : before: 23.8μs -> safesimd: 8.27μs -> simd: 7.6μs
Car -> Lin : before: 11.8μs -> safesimd: 1.91μs -> simd: 1.9μs
Lin -> Car : before: 10.5μs -> safesimd: 0.917μs -> simd: 1.03μs
Lin -> Lin : before: 1.25μs -> safesimd: 0.535μs -> simd: 0.535μs
size:(8, 8, 8, 7) -> size:(16, 16, 16)
Car -> Car : before: 20.9μs -> safesimd: 7.38μs -> simd: 6.78μs
Car -> Lin : before: 10.5μs -> safesimd: 1.7μs -> simd: 1.7μs
Lin -> Car : before: 9.2μs -> safesimd: 0.926μs -> simd: 0.928μs
Lin -> Lin : before: 1.04μs -> safesimd: 0.489μs -> simd: 0.493μs
size:(15, 16, 16) -> size:(8, 8, 8, 8)
Car -> Car : before: 20.5μs -> safesimd: 8.33μs -> simd: 7.08μs
Car -> Lin : before: 9μs -> safesimd: 1.62μs -> simd: 1.63μs
Lin -> Car : before: 11.8μs -> safesimd: 2.29μs -> simd: 2.3μs
Lin -> Lin : before: 1.11μs -> safesimd: 0.511μs -> simd: 0.514μs
size:(4, 4, 4, 4, 4, 4) -> size:(4, 4, 4, 4, 4, 4)
Car -> Car : before: 9.6μs -> safesimd: 3.96μs -> simd: 3.98μs
Car -> Lin : before: 18.9μs -> safesimd: 3.38μs -> simd: 3.58μs
Lin -> Car : before: 19.2μs -> safesimd: 5.57μs -> simd: 5.68μs
Lin -> Lin : before: 1.24μs -> safesimd: 0.539μs -> simd: 0.541μs
size:(5, 5, 5, 5, 5) -> size:(5, 5, 5, 5, 5)
Car -> Car : before: 6.24μs -> safesimd: 2.42μs -> simd: 2.56μs
Car -> Lin : before: 13μs -> safesimd: 2.19μs -> simd: 2.33μs
Lin -> Car : before: 11.1μs -> safesimd: 2.74μs -> simd: 2.73μs
Lin -> Lin : before: 0.913μs -> safesimd: 0.414μs -> simd: 0.413μs
size:(8, 8, 8, 8) -> size:(8, 8, 8, 8)
Car -> Car : before: 7.45μs -> safesimd: 2.1μs -> simd: 2.1μs
Car -> Lin : before: 12.1μs -> safesimd: 2.03μs -> simd: 1.92μs
Lin -> Car : before: 12.9μs -> safesimd: 2.49μs -> simd: 2.61μs
Lin -> Lin : before: 1.93μs -> safesimd: 0.55μs -> simd: 0.546μs
size:(16, 16, 16) -> size:(16, 16, 16)
Car -> Car : before: 4.3μs -> safesimd: 0.83μs -> simd: 0.822μs
Car -> Lin : before: 9.6μs -> safesimd: 0.914μs -> simd: 0.737μs
Lin -> Car : before: 11.4μs -> safesimd: 0.983μs -> simd: 1μs
Lin -> Lin : before: 1.93μs -> safesimd: 0.541μs -> simd: 0.538μs
size:(64, 64) -> size:(64, 64)
Car -> Car : before: 2.19μs -> safesimd: 0.54μs -> simd: 0.537μs
Car -> Lin : before: 7.82μs -> safesimd: 0.534μs -> simd: 0.538μs
Lin -> Car : before: 9.7μs -> safesimd: 0.585μs -> simd: 0.586μs
Lin -> Lin : before: 1.24μs -> safesimd: 0.537μs -> simd: 0.53μs

remove `@simd`

fix for 0d Cartesian AbstractArray. This version should be fast enough if the size of Cartesian array's first dim is larger than 16 (eltype Float64).

white space

Fix for other IndexStyle. Only use manually expanded version when the size of 1st dim >=16

Patch2

fix typo error.

add test for expanded version.

white space

fix white space, typo error; add test;

oscardssmith approved these changes Jul 1, 2021

View reviewed changes

N5N3 changed the title ~~Exclude the slow branch in Broadcast's copyto!~~ Remove the slow branch in Broadcast's copyto! Jul 2, 2021

N5N3 marked this pull request as draft July 2, 2021 01:34

Merge branch 'JuliaLang:master' into master

84db092

N5N3 closed this Jul 2, 2021

revert the a .= b branch

0d1439e

a faster copyto_unaliased

1d4dfee

N5N3 reopened this Jul 2, 2021

N5N3 changed the title ~~Remove the slow branch in Broadcast's copyto!~~ A faster copyto_unalias! Jul 2, 2021

N5N3 added 3 commits July 2, 2021 15:59

revert

d533a6b

Update broadcast.jl

9de88bb

Merge pull request #1 from N5N3/patch__

17fe447

revert

N5N3 marked this pull request as ready for review July 2, 2021 08:48

N5N3 changed the title ~~A faster copyto_unalias!~~ Speed up a .= b when axes(a) == axes(b) Jul 2, 2021

stevengj reviewed Jul 3, 2021

View reviewed changes

N5N3 added 3 commits July 5, 2021 08:58

Revert "revert"

691fb71

Update Base.jl

43d3581

Put SimdLoop in advance to avoid bootstrapping (I hope this work)

Merge pull request #2 from N5N3/revert-1-patch__

742c97a

Focus on copyto_unalias!, put SimdLoop in advance.

N5N3 marked this pull request as draft July 5, 2021 01:25

N5N3 added 4 commits July 5, 2021 09:36

Update compiler.jl

acbb8f4

Update compiler.jl

e52eb1e

Update Base.jl

06d5fb0

Update abstractarray.jl

1343477

Move some copyto! definitions to other place.

N5N3 added 8 commits July 6, 2021 23:20

Merge branch 'JuliaLang:master' into patch2

18456b1

Update abstractarray.jl

0c5611b

Update abstractarray.jl

781f176

Update multidimensional.jl

64efb57

Merge pull request #4 from N5N3/patch2

a6c23f5

remove `@simd`

Update multidimensional.jl

72e6fbd

Delete abstractarraypatch.jl

affccb8

Update Base.jl

ec0a0ce

JeffBezanson added arrays [a, r, r, a, y, s] performance Must go faster labels Jul 6, 2021

N5N3 added 15 commits July 7, 2021 21:10

Merge branch 'JuliaLang:master' into patch2

0255a6e

Update multidimensional.jl

e1db2ee

fix for 0d Cartesian AbstractArray. This version should be fast enough if the size of Cartesian array's first dim is larger than 16 (eltype Float64).

Merge pull request #6 from N5N3/patch2

c8fbc84

fix for 0d Cartesian AbstractArray. This version should be fast enough if the size of Cartesian array's first dim is larger than 16 (eltype Float64).

Merge branch 'JuliaLang:master' into patch2

911f0dc

Update multidimensional.jl

c303830

white space

Update abstractarray.jl

4273b9c

white space

Merge branch 'JuliaLang:master' into patch2

4515150

Update multidimensional.jl

5d4268e

Update multidimensional.jl

a599564

Fix for other IndexStyle. Only use manually expanded version when the size of 1st dim >=16

Merge pull request #7 from N5N3/patch2

36adcbd

Patch2

Merge branch 'JuliaLang:master' into patch2

553c2f7

Update multidimensional.jl

aa5ba0e

fix typo error.

Update copy.jl

dd45e9a

add test for expanded version.

Update multidimensional.jl

f5086dc

white space

Merge pull request #8 from N5N3/patch2

c3fd814

fix white space, typo error; add test;

vtjnash requested a review from mbauman August 24, 2021 23:32

N5N3 closed this Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A faster copyto_unaliased! #41434

A faster copyto_unaliased! #41434

N5N3 commented Jul 1, 2021 •

edited

Loading

N5N3 commented Jul 1, 2021

oscardssmith commented Jul 1, 2021

N5N3 commented Jul 2, 2021 •

edited

Loading

oscardssmith commented Jul 2, 2021

N5N3 commented Jul 2, 2021

N5N3 commented Jul 2, 2021 •

edited

Loading

stevengj Jul 3, 2021

N5N3 Jul 4, 2021 •

edited

Loading

stevengj Jul 4, 2021

N5N3 Jul 4, 2021 •

edited

Loading

stevengj Jul 5, 2021

KristofferC commented Jul 5, 2021

N5N3 commented Jul 6, 2021 •

edited

Loading

N5N3 commented Jul 6, 2021 •

edited

Loading

A faster copyto_unaliased! #41434

A faster copyto_unaliased! #41434

Conversation

N5N3 commented Jul 1, 2021 • edited Loading

N5N3 commented Jul 1, 2021

oscardssmith commented Jul 1, 2021

N5N3 commented Jul 2, 2021 • edited Loading

oscardssmith commented Jul 2, 2021

N5N3 commented Jul 2, 2021

N5N3 commented Jul 2, 2021 • edited Loading

stevengj Jul 3, 2021

Choose a reason for hiding this comment

N5N3 Jul 4, 2021 • edited Loading

Choose a reason for hiding this comment

stevengj Jul 4, 2021

Choose a reason for hiding this comment

N5N3 Jul 4, 2021 • edited Loading

Choose a reason for hiding this comment

stevengj Jul 5, 2021

Choose a reason for hiding this comment

KristofferC commented Jul 5, 2021

N5N3 commented Jul 6, 2021 • edited Loading

N5N3 commented Jul 6, 2021 • edited Loading

N5N3 commented Jul 1, 2021 •

edited

Loading

N5N3 commented Jul 2, 2021 •

edited

Loading

N5N3 commented Jul 2, 2021 •

edited

Loading

N5N3 Jul 4, 2021 •

edited

Loading

N5N3 Jul 4, 2021 •

edited

Loading

N5N3 commented Jul 6, 2021 •

edited

Loading

N5N3 commented Jul 6, 2021 •

edited

Loading