Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A faster copyto_unaliased! #41434

Closed
wants to merge 47 commits into from
Closed

A faster copyto_unaliased! #41434

wants to merge 47 commits into from

Conversation

N5N3
Copy link
Member

@N5N3 N5N3 commented Jul 1, 2021

At present, a .= b falls to copyto!(a, b) if axes(a) == axes(b) for performance optimization.
However, as mentioned in #40962 and #39345, the performance of copyto!(a, b) is very low in many cases. Thus, I think we can implement a faster version here, based on @simd, to solve the problem to some extent.
(I tried to to optimize copyto_unalias!, but many generated codes are not vectorized without @simd.)

This PR tries to speed up copyto_unaliased! in many cases:

  1. Linear-to-Linear copy is using Base.OneTo as the iterartor instead of LinearIndices for faster speed.
  2. zip() based iterator is used to speed up the most general case.
  3. The innermost loop for CartesianIndices is separated mannully like @simd (but add no Expr(:loopinfo) to speed up if the 1st dim's length is larger than 16.
size:(16, 2048) -> size:(16, 2048)
Car -> Car : before: 17.5μs -> after: 8.5μs
Car -> Lin : before: 52.2μs -> after: 8.5μs
Lin -> Car : before: 75.5μs -> after: 8.77μs
Lin -> Lin : before: 10.6μs -> after: 8.3μs
size:(16, 4, 512) -> size:(16, 4, 512)
Car -> Car : before: 18μs -> after: 9.5μs
Car -> Lin : before: 50.1μs -> after: 9.3μs
Lin -> Car : before: 82μs -> after: 11.6μs
Lin -> Lin : before: 10.7μs -> after: 8.37μs
size:(16, 4, 4, 128) -> size:(16, 4, 4, 128)
Car -> Car : before: 19.7μs -> after: 9.4μs
Car -> Lin : before: 79.3μs -> after: 9.2μs
Lin -> Car : before: 95.8μs -> after: 9.2μs
Lin -> Lin : before: 10.6μs -> after: 8.37μs
size:(16, 4, 4, 4, 32) -> size:(16, 4, 4, 4, 32)
Car -> Car : before: 20.4μs -> after: 9.8μs
Car -> Lin : before: 94.9μs -> after: 9.1μs
Lin -> Car : before: 119μs -> after: 10μs
Lin -> Lin : before: 10.6μs -> after: 8.37μs

The above Benchmarks, using SubArray{Float64}, are done in 1.7.0-beta3, where the performance regression on CartesianIndices's iteration(#38073) seems to be fixed.

julia> a = randn(50,50); b = randn(50,50); # 2D case
# before
julia> @Btime $a[1:end,1:end] .= $b;
  4.900 μs (0 allocations: 0 bytes)
julia> @Btime $a[:,:] .= $b;
  721.642 ns (0 allocations: 0 bytes)
# after
julia> @Btime $a[1:end,1:end] .= $b;
  355.238 ns (0 allocations: 0 bytes)
julia> @Btime $a[:,:] .= $b;
  351.643 ns (0 allocations: 0 bytes)

julia> a = randn(50*50); b = randn(50*50); #1D case
# before 
julia> @Btime $a .= $b;
  406.566 ns (0 allocations: 0 bytes)
# after
julia> @Btime $a .= $b;
  270.607 ns (0 allocations: 0 bytes)
@N5N3
Copy link
Member Author

N5N3 commented Jul 1, 2021

The failure in mpfr seems unrelated?

@oscardssmith
Copy link
Member

Can you add a 3d benchmark? Other than that, this looks good.

@N5N3 N5N3 changed the title Exclude the slow branch in Broadcast's copyto! Remove the slow branch in Broadcast's copyto! Jul 2, 2021
@N5N3
Copy link
Member Author

N5N3 commented Jul 2, 2021

It seems the speed of the general broadcast kernal is strongly affected by the array's shape.
50*50 is almost 5 times faster than 5*500.
Maybe the exclude in preprocess should be omitted in this case for better performance?
I'll make more test.
Without exclude things seems much better, so a better way is to implement faster copyto_unaliased!?

@N5N3 N5N3 marked this pull request as draft July 2, 2021 01:34
@N5N3 N5N3 closed this Jul 2, 2021
@oscardssmith
Copy link
Member

Why did you close this?

@N5N3
Copy link
Member Author

N5N3 commented Jul 2, 2021

Just for reverting the branch, and try to speed up copyto_unalias!.
The acceleration is considerable on my Desktop, but the code might be unclear (include some @simd that should not appear)

@N5N3 N5N3 reopened this Jul 2, 2021
@N5N3 N5N3 changed the title Remove the slow branch in Broadcast's copyto! A faster copyto_unalias! Jul 2, 2021
@N5N3
Copy link
Member Author

N5N3 commented Jul 2, 2021

Opps, it seems we can't use @simd in abstractarray.jl. But it does speed up a lot.
I looked into the native code, without @simd many generated code will not be vectorized. (even copy a LinearIndexed array to another LinearIndexed array).
I can't find a solution, so the optimization is limited to a .= b.
If the regression of copyto(b,a) is resolved, this seperatred implement could be reverted.

@N5N3 N5N3 marked this pull request as ready for review July 2, 2021 08:48
@N5N3 N5N3 changed the title A faster copyto_unalias! Speed up a .= b when axes(a) == axes(b) Jul 2, 2021
@@ -991,11 +991,17 @@ preprocess_args(dest, args::Tuple{}) = ()
# Specialize this method if all you want to do is specialize on typeof(dest)
@inline function copyto!(dest::AbstractArray, bc::Broadcasted{Nothing})
axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
# Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match
# Performance optimization: broadcast!(identity, dest, A) is equivalent to copyto!(dest, A) if indices match.
# However copyto!(dest, A) is very slow in many cases, implement a faster version here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to speed up copyto!(dest, A)?

Copy link
Member Author

@N5N3 N5N3 Jul 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I tried, but the problem seems to be a little complicated with the following facts:

  1. copyto!(::Array,::Array) call C's memmove, but it seems slower than a single loop:
julia> a = randn(1000); b = similar(a);
julia> @btime copyto!($b,$a);                    # this call C's memmove
  120.524 ns (0 allocations: 0 bytes)
julia> @btime copyto!(IndexLinear(),$b,IndexLinear(),$a);  # this call Julia's copyto_unalias!
  74.512 ns (0 allocations: 0 bytes)

It seems we can use C's memcpy instead of memmove to solve this.

  1. For abstractarray, I can't find a way to accelerate copyto_unalias!(dest, A) without @simd, as:
a = randn(40,40); b = similar(a);
f1!(a,b) = @inbounds for i in eachindex(IndexCartesian(), a)
    b[i] = a[i]
end

f2!(a,b) = @inbounds @simd for i in eachindex(IndexCartesian(), a)
    b[i] = a[i]
end
@btime f1!($a, $b)   #247.132 ns (0 allocations: 0 bytes)
@btime f2!($a, $b)  #220.548 ns (0 allocations: 0 bytes)  a little faster
@btime @views f1!($a[1:end,1:end], $b[1:end,1:end]) #   1.710 μs (0 allocations: 0 bytes)
@btime @views f2!($a[1:end,1:end], $b[1:end,1:end]) #   262.500 ns (0 allocations: 0 bytes) much faster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you use @simd?

Copy link
Member Author

@N5N3 N5N3 Jul 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buildbot throw an error that @simd is not defined. I‘m not familiar with the build system, but it seems to be a world age problem?(abstractarray.jl is included before simdloop.jl in Base.jl)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds just like a bootstrapping problem, i.e. @simd is not defined until later in the build process. That should be fixable by e.g. by moving the copyto! definition to a file loaded later.

N5N3 added 3 commits July 5, 2021 08:58
Put SimdLoop in advance to avoid bootstrapping (I hope this work)
Focus on copyto_unalias!, put SimdLoop in advance.
@N5N3 N5N3 marked this pull request as draft July 5, 2021 01:25
@KristofferC
Copy link
Sponsor Member

I think this is #38073 cc @kimikage

On 1.5:

julia> @btime foo!($b, $a) #  6.480 μs (0 allocations: 0 bytes)
  1.658 μs (0 allocations: 0 bytes)

julia> @btime simdfoo!($b, $a) #  2.000 μs (0 allocations: 0 bytes)
  2.368 μs (0 allocations: 0 bytes)

On 1.6:

julia> @btime foo!($b, $a) #  6.480 μs (0 allocations: 0 bytes)
  6.112 μs (0 allocations: 0 bytes)

julia> @btime simdfoo!($b, $a) #  2.000 μs (0 allocations: 0 bytes)
  1.768 μs (0 allocations: 0 bytes)

@N5N3
Copy link
Member Author

N5N3 commented Jul 6, 2021

Some more benchmarks, It seems 1.6.1 is stably slower than 1.5.4 even with a @simd on 4,5,6D CartesianIndices's iteration .
But at least @simd won't make performance worse now.

julia> a = randn(4,4,4,4,4,4); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 9.200 μs      1.5.4: 3.112 μs
julia> @btime simdfoo!($b,$a);        #  1.6.1: 3.763 μs      1.5.4: 6.040 μs

julia> a = randn(5,5,5,5,5); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 6.000 μs      1.5.4: 2.144 μs
julia> @btime simdfoo!($b,$a);        #  1.6.1: 2.378 μs      1.5.4: 3.287 μs

julia> a = randn(8,8,8,8); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 6.420 μs      1.5.4: 1.840 μs
julia> @btime simdfoo!($b,$a);        #  1.6.1: 1.990 μs      1.5.4: 2.678 μs

julia> a = randn(16,16,16); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 5.833 μs      1.5.4: 961.857 ns
julia> @btime simdfoo!($b,$a);        #  1.6.1: 747.619 ns    1.5.4: 999.900 ns

julia> a = randn(64,64); b = similar(a); a = view(a,axes(a)...); b = view(b,axes(b)...);
julia> @btime foo!($b,$a);            #  1.6.1: 3.737 μs      1.5.4: 5.650 μs
julia> @btime simdfoo!($b,$a);        #  1.6.1: 512.500 ns    1.5.4: 699.306 ns

@N5N3
Copy link
Member Author

N5N3 commented Jul 6, 2021

@mcabbott

There were concerns raised there about arrays like BitArray which alias themselves; I don't know whether or not this PR avoids that. (E.g. I don't know whether messing with Expr(:loopinfo is identical to @simd.)

I tried to remove every Expr(:loopinfo generated by @simd while keeping the loop body.
The results shows that the speed is a little bit influenced in some cases, but the acceleration is still considerable.
Since the expanded code is equivalent, I think this is safe enough?
Maybe we can add a safe option for @simd like:

@simd safe for i in 1:10
end

which only expands the loop but not marks it as SIMD.

Some benchmark:

size:(8, 8, 8, 8) -> size:(16, 16, 16)
Car -> Car : before: 23.8μs -> safesimd: 8.27μs -> simd: 7.6μs
Car -> Lin : before: 11.8μs -> safesimd: 1.91μs -> simd: 1.9μs
Lin -> Car : before: 10.5μs -> safesimd: 0.917μs -> simd: 1.03μs
Lin -> Lin : before: 1.25μs -> safesimd: 0.535μs -> simd: 0.535μs
size:(8, 8, 8, 7) -> size:(16, 16, 16)
Car -> Car : before: 20.9μs -> safesimd: 7.38μs -> simd: 6.78μs
Car -> Lin : before: 10.5μs -> safesimd: 1.7μs -> simd: 1.7μs
Lin -> Car : before: 9.2μs -> safesimd: 0.926μs -> simd: 0.928μs
Lin -> Lin : before: 1.04μs -> safesimd: 0.489μs -> simd: 0.493μs
size:(15, 16, 16) -> size:(8, 8, 8, 8)
Car -> Car : before: 20.5μs -> safesimd: 8.33μs -> simd: 7.08μs
Car -> Lin : before: 9μs -> safesimd: 1.62μs -> simd: 1.63μs
Lin -> Car : before: 11.8μs -> safesimd: 2.29μs -> simd: 2.3μs
Lin -> Lin : before: 1.11μs -> safesimd: 0.511μs -> simd: 0.514μs
size:(4, 4, 4, 4, 4, 4) -> size:(4, 4, 4, 4, 4, 4)
Car -> Car : before: 9.6μs -> safesimd: 3.96μs -> simd: 3.98μs
Car -> Lin : before: 18.9μs -> safesimd: 3.38μs -> simd: 3.58μs
Lin -> Car : before: 19.2μs -> safesimd: 5.57μs -> simd: 5.68μs
Lin -> Lin : before: 1.24μs -> safesimd: 0.539μs -> simd: 0.541μs
size:(5, 5, 5, 5, 5) -> size:(5, 5, 5, 5, 5)
Car -> Car : before: 6.24μs -> safesimd: 2.42μs -> simd: 2.56μs
Car -> Lin : before: 13μs -> safesimd: 2.19μs -> simd: 2.33μs
Lin -> Car : before: 11.1μs -> safesimd: 2.74μs -> simd: 2.73μs
Lin -> Lin : before: 0.913μs -> safesimd: 0.414μs -> simd: 0.413μs
size:(8, 8, 8, 8) -> size:(8, 8, 8, 8)
Car -> Car : before: 7.45μs -> safesimd: 2.1μs -> simd: 2.1μs
Car -> Lin : before: 12.1μs -> safesimd: 2.03μs -> simd: 1.92μs
Lin -> Car : before: 12.9μs -> safesimd: 2.49μs -> simd: 2.61μs
Lin -> Lin : before: 1.93μs -> safesimd: 0.55μs -> simd: 0.546μs
size:(16, 16, 16) -> size:(16, 16, 16)
Car -> Car : before: 4.3μs -> safesimd: 0.83μs -> simd: 0.822μs
Car -> Lin : before: 9.6μs -> safesimd: 0.914μs -> simd: 0.737μs
Lin -> Car : before: 11.4μs -> safesimd: 0.983μs -> simd: 1μs
Lin -> Lin : before: 1.93μs -> safesimd: 0.541μs -> simd: 0.538μs
size:(64, 64) -> size:(64, 64)
Car -> Car : before: 2.19μs -> safesimd: 0.54μs -> simd: 0.537μs
Car -> Lin : before: 7.82μs -> safesimd: 0.534μs -> simd: 0.538μs
Lin -> Car : before: 9.7μs -> safesimd: 0.585μs -> simd: 0.586μs
Lin -> Lin : before: 1.24μs -> safesimd: 0.537μs -> simd: 0.53μs

@JeffBezanson JeffBezanson added arrays [a, r, r, a, y, s] performance Must go faster labels Jul 6, 2021
N5N3 added 15 commits July 7, 2021 21:10
fix for 0d Cartesian AbstractArray. 
This version should be fast enough if the size of Cartesian array's first dim is larger than 16 (eltype Float64).
fix for 0d Cartesian AbstractArray. 
This version should be fast enough if the size of Cartesian array's first dim is larger than 16 (eltype Float64).
Fix for other IndexStyle. Only use manually expanded version when the size of 1st dim >=16
add test for expanded version.
fix white space, typo error; add test;
@vtjnash vtjnash requested a review from mbauman August 24, 2021 23:32
@N5N3 N5N3 closed this Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants