-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slowdown in threaded code from julia 1.2 to julia 1.4-DEV #121
Comments
Possibly related JuliaDSP/DSP.jl#339 |
In Julia 1.4, FFTW now uses partr threads (#105), which means that Unfortunately, spawning a partr thread has a fairly large overhead (less than a physical hardward thread, but much more than e.g. a cilk thread, and far more than a subroutine call), so this leads to a slowdown for a threaded loop of small transforms. cc @vtjnash |
Could the PARTR explain why there is a sudden jump in both execution time and memory allocations between when Julia is started with 2 and 4 threads respectively? # 2 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 10.84 MiB
allocs estimate: 8255
--------------
minimum time: 17.288 ms (0.00% GC)
median time: 17.457 ms (0.00% GC)
mean time: 17.661 ms (0.65% GC)
maximum time: 21.014 ms (0.00% GC)
--------------
samples: 283
evals/sample: 1
# 4 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial:
memory estimate: 118.42 MiB
allocs estimate: 1308803
--------------
minimum time: 91.844 ms (0.00% GC)
median time: 125.869 ms (25.65% GC)
mean time: 128.205 ms (19.04% GC)
maximum time: 251.423 ms (14.35% GC)
--------------
samples: 39
evals/sample: 1
julia> FFTW.fftw_vendor
:fftw MWE: using Pkg; pkg"add DSP#master"
using DSP, BenchmarkTools
img = randn(1000,1000);
kernel = randn(35,35);
typeof(img)
typeof(kernel)
@benchmark DSP.conv($img, $kernel) Julia version info
|
Note that if you are launching your own threads and want FFTW to execute its own plans serially, you can just do |
Thanks, setting the number of threads manually worked out wonderfully :) |
I think I just ran into a case where this issue causes convolution code in DSP.jl code to run 43x slower and use 85x more memory, although I'm not yet positive that the root cause is FFTW. Since this seems to be such an issue with DSP convolution, which does not use multithreading itself, I would like to better understand what the current recommendation from FFTW.jl is, or why this is such a problem. The affected function in DSP.jl, |
A more MWE that captures what's happening in DSP's using LinearAlgebra, FFTW, BenchmarkTools
function foo!(input)
s = size(input)
fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
p = plan_rfft(input)
ip = plan_brfft(fbuff, s[1])
for i in 1:53000
mul!(fbuff, p, input)
mul!(input, ip, fbuff)
end
return input
end
A = rand(8, 8);
@benchmark foo!(A) With four threads I get:
with one thread I get:
|
Julia v1.4.1: julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads
julia> nthreads()
1
julia> function foo!(input)
s = size(input)
fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
p = plan_rfft(input)
ip = plan_brfft(fbuff, s[1])
for i in 1:53000
mul!(fbuff, p, input)
mul!(input, ip, fbuff)
end
return input
end
foo! (generic function with 1 method)
julia> A = rand(8, 8);
julia> FFTW.set_num_threads(1)
julia> @btime foo!($A);
12.559 ms (124 allocations: 8.59 KiB)
julia> FFTW.set_num_threads(4)
julia> @btime foo!($A);
2.904 s (124 allocations: 8.59 KiB) Julia v1.0.5: julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads
julia> nthreads()
1
julia> function foo!(input)
s = size(input)
fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
p = plan_rfft(input)
ip = plan_brfft(fbuff, s[1])
for i in 1:53000
mul!(fbuff, p, input)
mul!(input, ip, fbuff)
end
return input
end
foo! (generic function with 1 method)
julia> A = rand(8, 8);
julia> FFTW.set_num_threads(1)
julia> @btime foo!($A);
12.586 ms (126 allocations: 8.75 KiB)
julia> FFTW.set_num_threads(4)
julia> @btime foo!($A);
2.882 s (126 allocations: 8.75 KiB) |
Oh sorry, my benchmarks were all on Julia 1.4.1, with |
Is there any way to get the current number of FFTW |
Was FFTW's num_threads always set to the number of Julia threads, or has that changed recently? |
It seems like one workaround is making plans with the #117 seems like it would be difficult to implement since I can't find an accessor-like function counterpart to This seems to me like a very large performance regression when using FFTW.jl "out of the box." I understand that enabling threaded plans by default, without regard to the problem size, inherently involves a trade-off in performance between small and large problems. On the one hand, a 240x slowdown of small problems might not be that noticeable when the execution runtime was so short to begin with, while a ~4x (or however many cores a user has) speed up for large problems might translate into seconds saved. However, if people are making plans, they are probably using it many times, and a two orders of magnitude slowdown for small problems can really add up. Is Some mention of this performance regression in the docs might be helpful. I'll try to throw a PR together sometime if that would be helpful. |
Convolutions in DSP currently rely on FFTW.jl, and a recent change in FFTW.jl (JuliaMath/FFTW.jl#105) has introduced a large performance regression in `conv` whenever Julia is started with more than one thread. Since v1 of FFTW.jl, it uses multi-threaded FFTW transformations by default whenever Julia has more than one thread. This new default causes small FFT problems to run much more slowly and use much more memory. Since the overlap-save method of `conv` in DSP breaks a convolutions into small convolutions, and therefore performs a large number of small FFTW transformations, this change can cause convolutions to be slower by two orders of magnitude, and similarly use two orders of magnitude more memory. While FFTW.jl does not provide an explicit way to set the number of threads used by a FFTW plan without changing a global variable, generating the plans with the planning flag set to `FFTW.PATIENT` (instead of the default `MEASURE`) allows the planner to consider changing the number of threads. Adding this flag to the plans generated by the overlap-save convolution method seems to rescue the performance regression on multi-threaded instances of Julia. Fixes JuliaDSP#399 Also see JuliaMath/FFTW.jl#121
Using the |
I have a package where I have different threads performing different FFTs and I observe a significant slowdown in the latest
julia-1.4-DEV
Here is a MWE:
On my 2016 MacBook Pro I get these times
i.e. there is significant overhead with multiple threads at small array dimensions.
I am not sure this belongs to
FFTW.jl
or juliaBase
, but replacing theLinearAlgebra.mul!(Â[i], plan, A[i])
line with anything else (e.g.A[i] .+= A[i]
) does not incur in the same penalty, so I am submitting the issue here.The text was updated successfully, but these errors were encountered: