Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factorization twice as slow on 0.6 than 0.5 #21624

Closed
jebej opened this issue Apr 28, 2017 · 27 comments · Fixed by #21688
Closed

Factorization twice as slow on 0.6 than 0.5 #21624

jebej opened this issue Apr 28, 2017 · 27 comments · Fixed by #21688
Labels
performance Must go faster regression Regression in behavior compared to a previous version

Comments

@jebej
Copy link
Contributor

jebej commented Apr 28, 2017

A = rand(40,40);
@benchmark sqrtm($A)

0.6: (55c97fb)

BenchmarkTools.Trial:
  memory estimate:  269.69 KiB
  allocs estimate:  26
  --------------
  minimum time:     3.855 ms (0.00% GC)
  median time:      3.999 ms (0.00% GC)
  mean time:        4.105 ms (0.53% GC)
  maximum time:     12.227 ms (15.49% GC)
  --------------
  samples:          1216
  evals/sample:     1

0.5.1:

BenchmarkTools.Trial:
  memory estimate:  269.69 KiB
  allocs estimate:  26
  --------------
  minimum time:     1.731 ms (0.00% GC)
  median time:      1.795 ms (0.00% GC)
  mean time:        1.868 ms (1.13% GC)
  maximum time:     9.239 ms (0.00% GC)
  --------------
  samples:          2653
  evals/sample:     1
@tkelman tkelman added performance Must go faster regression Regression in behavior compared to a previous version labels Apr 28, 2017
@fredrikekre
Copy link
Member

Its due to these regression as noted in #20993:

| ["linalg","factorization",("schur","Matrix",1024)] | 2.10 (45%) ❌ | 1.00 (1%) |
| ["linalg","factorization",("schur","Matrix",256)] | 2.36 (45%) ❌ | 1.00 (1%) |
| ["linalg","factorization",("schurfact","Matrix",1024)] | 2.09 (45%) ❌ | 1.00 (1%) |
| ["linalg","factorization",("schurfact","Matrix",256)] | 2.36 (45%) ❌ | 1.00 (1%) |

@jebej
Copy link
Contributor Author

jebej commented Apr 29, 2017

Right. Do we have an idea what caused those? Isn't that all handled by Lapack? A factor of two seems high.

@fredrikekre
Copy link
Member

Code path have not changed, and not the lapack wrapper either. But the factor 2 are there:

julia> A = rand(256, 256); cA = complex(A);

julia> @btime LinAlg.LAPACK.gees!('V', copy($cA)); # 0.6
  200.070 ms (11 allocations: 2.14 MiB)

julia> @btime LinAlg.LAPACK.gees!('V', copy($cA)); # 0.5.1
  90.971 ms (11 allocations: 2.14 MiB)

@andreasnoack
Copy link
Member

andreasnoack commented Apr 29, 2017 via email

@jebej
Copy link
Contributor Author

jebej commented Apr 29, 2017

Nothing obvious shows up when profiling.
0.6:

194 .\REPL[2]:1; macro expansion
 3   .\array.jl:126; copy(::Array{Complex{Float64},2})
 191 .\linalg\lapack.jl:5586; gees!(::Char, ::Array{Complex...

0.5.1:

66 .\REPL[27]:1; macro expansion;
 65 .\linalg\lapack.jl:5584; gees!(::Char, ::Array{Complex{Float64},2})
 1  .\linalg\lapack.jl:5596; gees!(::Char, ::Array{Complex{Float64},2})

@fredrikekre
Copy link
Member

Same version of OpenBLAS?

Yes, I have just compiled the default for master and 0.5.1 and OpenBLAS version does not seem to have changed?

fredrik@fredrik-desktop:~$ cat julia-master/deps/openblas.version 
OPENBLAS_BRANCH=v0.2.19
OPENBLAS_SHA1=85636ff1a015d04d3a8f960bc644b85ee5157135

fredrik@fredrik-desktop:~$ cat julia-release/deps/openblas.version 
ifneq (,$(filter $(ARCH), powerpc64le ppc64le))
OPENBLAS_BRANCH=v0.2.19
OPENBLAS_SHA1=85636ff1a015d04d3a8f960bc644b85ee5157135
else
OPENBLAS_BRANCH=v0.2.18
OPENBLAS_SHA1=12ab1804b6ebcd38b26960d65d254314d8bc33d6

@jebej jebej changed the title sqrtm slower on 0.6 than 0.5 Factorization twice as slow on 0.6 than 0.5 Apr 29, 2017
@jebej
Copy link
Contributor Author

jebej commented Apr 29, 2017

Would it be an issue with the ccall then?

@KristofferC
Copy link
Sponsor Member

These calls should be long enough for ccall overhead not to matter. Perhaps some difference in build flags used?

@jebej
Copy link
Contributor Author

jebej commented May 1, 2017

I can't compile julia from scratch (Windows) but is there an archive of old nightlies to see if I can identify when the problem started occurring? I think this is a pretty big issue that really should be fixed...

@tkelman
Copy link
Contributor

tkelman commented May 1, 2017

julianightlies.s3.amazonaws.com goes back about a month

@stevengj
Copy link
Member

stevengj commented May 1, 2017

I thought sqrtm was supposed to be faster in 0.6 from #20214.

@jebej
Copy link
Contributor Author

jebej commented May 1, 2017

According to this report the regressions date from before March 8th.

@stevengj
Copy link
Member

stevengj commented May 1, 2017

@jebej, #20214 was merged on Feb 3; can you check whether it is the source of the regression?

@jebej
Copy link
Contributor Author

jebej commented May 1, 2017

I believe that PR seems was only for sqrtm of Upper Triangular matrices. The regressions in the OP was sqrtm for generic matrices, which uses schurfact, and thus gees.

@stevengj
Copy link
Member

stevengj commented May 1, 2017

@jebej, schurfact is used to get an upper triangular matrix from an arbitrary matrix, at which point the sqrtm code calls sqrtm(UpperTriangular(SchurF[:T])). So, #20214 still seems relevant.

@stevengj
Copy link
Member

stevengj commented May 1, 2017

Oh, but I see that you are specifically benchmarking gees above and still seeing a slowdown.

@jebej
Copy link
Contributor Author

jebej commented May 1, 2017

You're right, my mistake. Yes, the slowdown seems to be with gees in that case, and there is a similar one with geevx.

@stevengj
Copy link
Member

stevengj commented May 1, 2017

I can't reproduce the gees! slowdown on my Mac:

julia> A = rand(256, 256); cA = complex(A);

julia> @btime LinAlg.LAPACK.gees!('V', copy($cA));  # 0.5.1
  257.194 ms (11 allocations: 2.14 MiB)

julia> @btime LinAlg.LAPACK.gees!('V', copy($cA)); # 0.6
  251.435 ms (11 allocations: 2.14 MiB)

@stevengj
Copy link
Member

stevengj commented May 1, 2017

Could it be a threading issue on your machine? Try BLAS.set_num_threads(1).

@jebej
Copy link
Contributor Author

jebej commented May 1, 2017

Is the optimization level the same between macOS and Linux? I vaguely recall a change from -O3 to -O2 for some dependencies, but I can't find the issue.

@jebej
Copy link
Contributor Author

jebej commented May 1, 2017

It's not just my machine, the slowdown is detected by nanosoldier.

@fredrikekre
Copy link
Member

I can't reproduce the gees! slowdown on my Mac

I just tried on my Mac as well and I still see a slowdown with a factor 2...

@andreasnoack
Copy link
Member

It might be useful if people would report versioninfo() as well. Maybe we can narrow down if this only happens on some architectures.

@fredrikekre
Copy link
Member

I can see a slowdown of 2 (or more) on the following 3 setups (comparing 0.5.1 and the release-0.6 branch):

julia> versioninfo()
Julia Version 0.6.0-pre.beta.460
Commit 63ae651* (2017-05-02 12:18 UTC)
Platform Info:
 OS: Linux (x86_64-linux-gnu)
 CPU: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
 WORD_SIZE: 64
 BLAS: libopenblas (USE64BITINT NO_AFFINITY HASWELL)
 LAPACK: libopenblas64_
 LIBM: libopenlibm
 LLVM: libLLVM-3.9.1 (ORCJIT, broadwell)
julia> versioninfo()
Julia Version 0.6.0-pre.beta.460
Commit 63ae651d91* (2017-05-02 12:18 UTC)
Platform Info:
 OS: macOS (x86_64-apple-darwin16.5.0)
 CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
 WORD_SIZE: 64
 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
 LAPACK: libopenblas64_
 LIBM: libopenlibm
 LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
julia> versioninfo()
Julia Version 0.6.0-pre.beta.460
Commit 63ae651 (2017-05-02 12:18 UTC)
Platform Info:
 OS: Linux (x86_64-linux-gnu)
 CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
 WORD_SIZE: 64
 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
 LAPACK: libopenblas64_
 LIBM: libopenlibm
 LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

@fredrikekre
Copy link
Member

I am so confused. With this:

using BenchmarkTools
function benchmark()
    versioninfo()
    A = rand(256, 256)
    cA = complex(A)
    for i in 1:3
        println(minimum(@benchmark LinAlg.LAPACK.gees!('V', copy($cA))))
    end
end
benchmark()

I get the following on the same commit the only difference being that one is from the release-0.6 branch and the other is on the master branch.

609b3d1 checked out from master:

julia> benchmark()
Julia Version 0.6.0-pre.beta.459
Commit 609b3d1* (2017-05-02 04:05 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
TrialEstimate(252.737 ms)
TrialEstimate(247.091 ms)
TrialEstimate(218.330 ms)

609b3d1 checked out from release-0.6:

julia> benchmark()
Julia Version 0.6.0-pre.beta.459
Commit 609b3d1 (2017-05-02 04:05 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)
VERSION
TrialEstimate(513.989 ms)
TrialEstimate(514.151 ms)
TrialEstimate(504.442 ms)

@fredrikekre
Copy link
Member

So it turns out that my directory for the master branch has a fast version of libopenblas, and compiling from scratch (which I did for the release-0.6 version) compiles a slow version. I just tried compiling everything from scratch and that results in a slow version.

My conclusion is that some build flags must have changed, but did not trigger a recompile of the library in my master directory (which I have been updating continuously).

@ViralBShah
Copy link
Member

Can you figure out what openblas incantations get the performance back? Is this something to do with DYNAMIC_ARCH detection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster regression Regression in behavior compared to a previous version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants