slowdown in threaded code from julia 1.2 to julia 1.4-DEV #121

gasagna · 2019-10-25T19:08:31Z

I have a package where I have different threads performing different FFTs and I observe a significant slowdown in the latest julia-1.4-DEV

Here is a MWE:

using BenchmarkTools
using LinearAlgebra
using Base.Threads
using FFTW

function setup(N, M=4)
    A = [randn(N, N) for i = 1:M]
    Â = [rfft(A[i], [2, 1]) for i = 1:M]
    plan = plan_rfft(A[1], [2, 1])
    return A, Â, plan
end

function test(A, Â, plan)
    @threads for i = 1:length(A)
        LinearAlgebra.mul!(Â[i], plan, A[i])
    end
end

for N = [32, 64, 128, 256]
    args = setup(N, 4)
    @btime test($args...)
end

On my 2016 MacBook Pro I get these times

davide@vega - Desktop$: export JULIA_NUM_THREADS=1; julia mew.jl 
  19.083 μs (11 allocations: 816 bytes)
  63.617 μs (11 allocations: 816 bytes)
  371.126 μs (11 allocations: 816 bytes)
  2.355 ms (11 allocations: 816 bytes)
davide@vega - Desktop$: export JULIA_NUM_THREADS=1; julia-master mew.jl 
  19.733 μs (9 allocations: 784 bytes)
  64.619 μs (9 allocations: 784 bytes)
  372.343 μs (9 allocations: 784 bytes)
  2.351 ms (9 allocations: 784 bytes)
davide@vega - Desktop$: export JULIA_NUM_THREADS=2; julia mew.jl 
  17.831 μs (18 allocations: 1.48 KiB)
  41.078 μs (17 allocations: 1.47 KiB)
  195.599 μs (17 allocations: 1.47 KiB)
  1.197 ms (17 allocations: 1.47 KiB)
davide@vega - Desktop$: export JULIA_NUM_THREADS=2; julia-master mew.jl 
  115.478 μs (1069 allocations: 84.78 KiB)
  160.183 μs (1108 allocations: 87.83 KiB)
  264.680 μs (631 allocations: 51.31 KiB)
  1.289 ms (630 allocations: 51.30 KiB)
davide@vega - Desktop$: export JULIA_NUM_THREADS=4; julia mew.jl 
  19.946 μs (29 allocations: 2.27 KiB)
  31.612 μs (29 allocations: 2.28 KiB)
  113.469 μs (29 allocations: 2.31 KiB)
  654.074 μs (30 allocations: 2.31 KiB)
davide@vega - Desktop$: export JULIA_NUM_THREADS=4; julia-master mew.jl 
  85.846 μs (1651 allocations: 133.36 KiB)
  67.317 μs (1043 allocations: 86.73 KiB)
  235.660 μs (2230 allocations: 178.41 KiB)
  798.563 μs (2478 allocations: 197.66 KiB)

i.e. there is significant overhead with multiple threads at small array dimensions.

I am not sure this belongs to FFTW.jl or julia Base, but replacing the LinearAlgebra.mul!(Â[i], plan, A[i]) line with anything else (e.g. A[i] .+= A[i]) does not incur in the same penalty, so I am submitting the issue here.

The text was updated successfully, but these errors were encountered:

baggepinnen · 2020-01-16T17:12:52Z

Possibly related JuliaDSP/DSP.jl#339

stevengj · 2020-01-16T19:53:39Z

In Julia 1.4, FFTW now uses partr threads (#105), which means that plan_rfft creates a plan that itself (potentially) uses threads. When you wrap the FFTW execution in @threads for, then partr decides how to schedule the for threads and the FFTW threads among the physical CPU threads.

Unfortunately, spawning a partr thread has a fairly large overhead (less than a physical hardward thread, but much more than e.g. a cilk thread, and far more than a subroutine call), so this leads to a slowdown for a threaded loop of small transforms.

cc @vtjnash

baggepinnen · 2020-01-17T07:10:35Z

Could the PARTR explain why there is a sudden jump in both execution time and memory allocations between when Julia is started with 2 and 4 threads respectively?

# 2 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  10.84 MiB
  allocs estimate:  8255
  --------------
  minimum time:     17.288 ms (0.00% GC)
  median time:      17.457 ms (0.00% GC)
  mean time:        17.661 ms (0.65% GC)
  maximum time:     21.014 ms (0.00% GC)
  --------------
  samples:          283
  evals/sample:     1

# 4 threads
julia> @benchmark DSP.conv($img, $kernel)
BenchmarkTools.Trial: 
  memory estimate:  118.42 MiB
  allocs estimate:  1308803
  --------------
  minimum time:     91.844 ms (0.00% GC)
  median time:      125.869 ms (25.65% GC)
  mean time:        128.205 ms (19.04% GC)
  maximum time:     251.423 ms (14.35% GC)
  --------------
  samples:          39
  evals/sample:     1

julia> FFTW.fftw_vendor
:fftw

MWE:

using Pkg; pkg"add DSP#master"
using DSP, BenchmarkTools
img = randn(1000,1000);
kernel = randn(35,35);
typeof(img)
typeof(kernel)
@benchmark DSP.conv($img, $kernel)

Julia version info

julia> versioninfo(verbose=true)
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      Ubuntu 18.04.1 LTS
  uname: Linux 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64
  CPU: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz: 
              speed         user         nice          sys         idle          irq
       #1  2442 MHz     413888 s      38233 s      84143 s    3291817 s          0 s
       #2  2329 MHz     402000 s      31602 s      84474 s    1964856 s          0 s
       #3  2335 MHz     309327 s      11461 s     106453 s    2008096 s          0 s
       #4  2477 MHz     356923 s      36650 s      85812 s    1956186 s          0 s
       
  Memory: 15.564483642578125 GB (502.1171875 MB free)
  Uptime: 113146.0 sec
  Load Avg:  0.5146484375  0.63671875  0.728515625
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, sandybridge)
Environment:
  JULIA_EDITOR = atom
  JULIA_NUM_THREADS = 4
  MANDATORY_PATH = /usr/share/gconf/ubuntu.mandatory.path
  DEFAULTS_PATH = /usr/share/gconf/ubuntu.default.path
  HOME = /home/fredrikb
  WINDOWPATH = 2
  TERM = xterm-256color
  PATH = /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

stevengj · 2020-01-18T22:03:10Z

Note that if you are launching your own threads and want FFTW to execute its own plans serially, you can just do FFTW.set_num_threads(1) before creating your plans.

baggepinnen · 2020-01-21T07:59:09Z

Thanks, setting the number of threads manually worked out wonderfully :)

ViralBShah · 2020-02-19T21:20:37Z

cc @JeffBezanson @Keno

galenlynch · 2020-05-23T20:46:11Z

I think I just ran into a case where this issue causes convolution code in DSP.jl code to run 43x slower and use 85x more memory, although I'm not yet positive that the root cause is FFTW. Since this seems to be such an issue with DSP convolution, which does not use multithreading itself, I would like to better understand what the current recommendation from FFTW.jl is, or why this is such a problem. The affected function in DSP.jl, conv, performs a large number of small FFT transformations serially, and performance drops off a cliff when setting JULIA_NUM_THREADS=4 prior to starting Julia. I'm a bit hesitant to make a change with global effect, like setting FFTW.set_num_threads(1), inside of conv. Perhaps there is another workaround in this situation?

galenlynch · 2020-05-23T23:58:00Z

A more MWE that captures what's happening in DSP's conv:

using LinearAlgebra, FFTW, BenchmarkTools
function foo!(input)
    s = size(input)
    fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
    p = plan_rfft(input)
    ip = plan_brfft(fbuff, s[1])
    for i in 1:53000
        mul!(fbuff, p, input)
        mul!(input, ip, fbuff)
    end
    return input
end
A = rand(8, 8);
@benchmark foo!(A)

With four threads I get:

  memory estimate:  1.05 GiB
  allocs estimate:  10489157
  --------------
  minimum time:     1.099 s (5.75% GC)
  median time:      1.101 s (5.95% GC)
  mean time:        1.105 s (5.85% GC)
  maximum time:     1.122 s (5.63% GC)
  --------------
  samples:          5
  evals/sample:     1

with one thread I get:

  memory estimate:  8.59 KiB
  allocs estimate:  124
  --------------
  minimum time:     10.278 ms (0.00% GC)
  median time:      10.889 ms (0.00% GC)
  mean time:        11.350 ms (0.00% GC)
  maximum time:     25.835 ms (0.00% GC)
  --------------
  samples:          440
  evals/sample:     1

giordano · 2020-05-24T00:12:51Z

Julia v1.4.1:

julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads

julia> nthreads()
1

julia> function foo!(input)
           s = size(input)
           fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
           p = plan_rfft(input)
           ip = plan_brfft(fbuff, s[1])
           for i in 1:53000
               mul!(fbuff, p, input)
               mul!(input, ip, fbuff)
           end
           return input
       end
foo! (generic function with 1 method)

julia> A = rand(8, 8);

julia> FFTW.set_num_threads(1)

julia> @btime foo!($A);
  12.559 ms (124 allocations: 8.59 KiB)

julia> FFTW.set_num_threads(4)

julia> @btime foo!($A);
  2.904 s (124 allocations: 8.59 KiB)

Julia v1.0.5:

julia> using LinearAlgebra, FFTW, BenchmarkTools, Base.Threads

julia> nthreads()
1

julia> function foo!(input)
           s = size(input)
           fbuff = similar(input, Complex{eltype(input)}, (div(s[1], 2) + 1, s[2]))
           p = plan_rfft(input)
           ip = plan_brfft(fbuff, s[1])
           for i in 1:53000
               mul!(fbuff, p, input)
               mul!(input, ip, fbuff)
           end
           return input
       end
foo! (generic function with 1 method)

julia> A = rand(8, 8);

julia> FFTW.set_num_threads(1)

julia> @btime foo!($A);
  12.586 ms (126 allocations: 8.75 KiB)

julia> FFTW.set_num_threads(4)

julia> @btime foo!($A);
  2.882 s (126 allocations: 8.75 KiB)

galenlynch · 2020-05-24T00:49:22Z

Oh sorry, my benchmarks were all on Julia 1.4.1, with JULIA_NUM_THREADS=4 and JULIA_NUM_THREADS=1.

galenlynch · 2020-05-24T00:54:49Z

Is there any way to get the current number of FFTW num_threads, so a function can reversibly alter it?

giordano · 2020-05-24T00:57:46Z

#117

galenlynch · 2020-05-24T01:02:09Z

Was FFTW's num_threads always set to the number of Julia threads, or has that changed recently?

giordano · 2020-05-24T01:05:04Z

It's 4 times the number of Julia threads, when they are more than 1:

FFTW.jl/src/FFTW.jl

Line 60 in d5a74b9

    
           set_num_threads(nthreads() * 4) # spawn more tasks than threads to help load-balancing

. This was introduced in #105

galenlynch · 2020-05-24T11:58:36Z

It seems like one workaround is making plans with the FFTW.PATIENT flag, which according to the FFTW3 docs will change the number of threads depending on the problem size. For the example I gave above, this rescues the performance without having to call set_num_threads.

#117 seems like it would be difficult to implement since I can't find an accessor-like function counterpart to fftw_plan_with_nthreads in FFTW3's library index. Perhaps arguments to FFTW.jl's set_num_threads could be cached by the module, allowing the creation of an accessor function to get the last number of threads passed to FFTW3.

This seems to me like a very large performance regression when using FFTW.jl "out of the box." I understand that enabling threaded plans by default, without regard to the problem size, inherently involves a trade-off in performance between small and large problems. On the one hand, a 240x slowdown of small problems might not be that noticeable when the execution runtime was so short to begin with, while a ~4x (or however many cores a user has) speed up for large problems might translate into seconds saved. However, if people are making plans, they are probably using it many times, and a two orders of magnitude slowdown for small problems can really add up.

Is set_num_threads going to be part of the supported API of this package? I don't see it in the docs.

Some mention of this performance regression in the docs might be helpful. I'll try to throw a PR together sometime if that would be helpful.

Convolutions in DSP currently rely on FFTW.jl, and a recent change in FFTW.jl (JuliaMath/FFTW.jl#105) has introduced a large performance regression in `conv` whenever Julia is started with more than one thread. Since v1 of FFTW.jl, it uses multi-threaded FFTW transformations by default whenever Julia has more than one thread. This new default causes small FFT problems to run much more slowly and use much more memory. Since the overlap-save method of `conv` in DSP breaks a convolutions into small convolutions, and therefore performs a large number of small FFTW transformations, this change can cause convolutions to be slower by two orders of magnitude, and similarly use two orders of magnitude more memory. While FFTW.jl does not provide an explicit way to set the number of threads used by a FFTW plan without changing a global variable, generating the plans with the planning flag set to `FFTW.PATIENT` (instead of the default `MEASURE`) allows the planner to consider changing the number of threads. Adding this flag to the plans generated by the overlap-save convolution method seems to rescue the performance regression on multi-threaded instances of Julia. Fixes JuliaDSP#399 Also see JuliaMath/FFTW.jl#121

galenlynch · 2020-05-25T12:42:07Z

Using the FFTW.PATIENT flag turns out not to be a great workaround, for the perhaps obvious reason that it can take a long time (e.g. JuliaDSP/DSP.jl#362).

dsweber2 mentioned this issue Jan 31, 2020

Building with multiple threads breaks remotecalls of plan_fft #135

Open

chrisbrahms mentioned this issue Apr 19, 2020

RFC: possible convenience interface LupoLab/Luna.jl#129

Closed

This was referenced Apr 27, 2020

FFTW wisdom is not thread safe? LupoLab/Luna.jl#160

Closed

FFTW wisdom for threads LupoLab/Luna.jl#162

Merged

galenlynch mentioned this issue May 24, 2020

Workaround for performance regression introduced by FFTW JuliaDSP/DSP.jl#362

Closed

stevengj mentioned this issue May 25, 2020

don't set nthreads > 1 by default #151

Merged

stevengj closed this as completed in #151 May 25, 2020

chrisbrahms mentioned this issue Aug 24, 2022

fftw3 in prop_gnlse is causing Julia 1.8 to crash LupoLab/Luna.jl#285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slowdown in threaded code from julia 1.2 to julia 1.4-DEV #121

slowdown in threaded code from julia 1.2 to julia 1.4-DEV #121

gasagna commented Oct 25, 2019

baggepinnen commented Jan 16, 2020

stevengj commented Jan 16, 2020

baggepinnen commented Jan 17, 2020

stevengj commented Jan 18, 2020

baggepinnen commented Jan 21, 2020

ViralBShah commented Feb 19, 2020

galenlynch commented May 23, 2020 •

edited

Loading

galenlynch commented May 23, 2020

giordano commented May 24, 2020

galenlynch commented May 24, 2020

galenlynch commented May 24, 2020

giordano commented May 24, 2020

galenlynch commented May 24, 2020

giordano commented May 24, 2020

galenlynch commented May 24, 2020

galenlynch commented May 25, 2020

slowdown in threaded code from julia 1.2 to julia 1.4-DEV #121

slowdown in threaded code from julia 1.2 to julia 1.4-DEV #121

Comments

gasagna commented Oct 25, 2019

baggepinnen commented Jan 16, 2020

stevengj commented Jan 16, 2020

baggepinnen commented Jan 17, 2020

stevengj commented Jan 18, 2020

baggepinnen commented Jan 21, 2020

ViralBShah commented Feb 19, 2020

galenlynch commented May 23, 2020 • edited Loading

galenlynch commented May 23, 2020

giordano commented May 24, 2020

galenlynch commented May 24, 2020

galenlynch commented May 24, 2020

giordano commented May 24, 2020

galenlynch commented May 24, 2020

giordano commented May 24, 2020

galenlynch commented May 24, 2020

galenlynch commented May 25, 2020

galenlynch commented May 23, 2020 •

edited

Loading