Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

type-stable inner loop for sqrtm #20214

Merged
merged 11 commits into from
Feb 3, 2017
Merged

Conversation

felixrehren
Copy link
Contributor

As suggested by Ralph_Smith on discourse

On my machine: speedup x15

@vchuravy
Copy link
Member

Do we have a benchmark for this in BaseBenchmarks?

@felixrehren
Copy link
Contributor Author

felixrehren commented Jan 24, 2017

I can't find a benchmark. What would that look like?

function b_20214(n=100,m=127)
    t = 0
    for i in 1:n
        A = randn(m,m)
        A = UpperTriangular(schurfact(A'*A)[:Schur])
        t += @timeit sqrtm(A)
    end
    t
end

@andreasnoack
Copy link
Member

Take a look at https://github.com/JuliaCI/BaseBenchmarks.jl#contributing and https://github.com/JuliaCI/BaseBenchmarks.jl/blob/master/src/linalg/LinAlgBenchmarks.jl. The package will handle the timing. You should just provide the function.

@@ -1888,10 +1898,7 @@ function sqrtm{T}(A::UpperTriangular{T})
for j = 1:n
Copy link
Member

@stevengj stevengj Jan 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should just have sqrtm(A) call a sqrtm{realmatrix}(A,::Val{realmatrix}) function with either Val{true} or Val{false} so that every operation on R is type-stable, and then you won't need floop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put @inbounds in front of this loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done both 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just put the @inbounds in front of the for ... end group no need of the extra begin ... end here. That is,

@inbounds for j = 1:n
    # loop body
end

@@ -1900,14 +1907,10 @@ end
function sqrtm{T}(A::UnitUpperTriangular{T})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could replace this with

sqrtm{T}(A::UnitUpperTriangular{T}) = sqrtm(A, Val{true})

in terms of the abovementioned method. Given a Val method that is type-stable, the only remaining reason to duplicate the code here seems to be to save the sqrt call for the diagonal elements, but I doubt this is worth the trouble since there are only O(n) such calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look at this tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since UnitUpperTriangular{T} is not a subtype of UpperTriangular{T} (??), I think it would be

sqrtm(A::UnitUpperTriangular) = UnitUpperTriangular(sqrtm(UpperTriangular(A),Val{true}))

For some reason this is faster than the present method, which confuses me. This version involves two additional conversions and some unnecessary arithmetic operations in the loop (of order O(n) as you mention)! Are some ops on UnitUpperTriangular slower than the same op on UpperTriangular?

Will make this change later unless someone can enlighten me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Julia doesn't support inheritance of concrete types.)

Indexing expressions A[i,j] are slower for UnitUpperTriangular than for UpperTriangular, and both are slower than for Array, because it has to check if i <= j and (in the unit case) whether i == j.

Since by construction you only access the upper triangle, I would do:

B = A.data

at the beginning of the function (for both the UnitUpperTriangular and UpperTriangular methods), and then use B[i,j] rather than A[i,j].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(There might be other methods operating on triangular matrix types that might benefit from a similar change.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @inbounds means that it never checks where i and j are in 1:n. It still checks whether i ≤ j, since if i > j then it just returns zero rather than looking at A.data.

Timing such a small operation is tricky; I would normally recommend BenchmarkTools instead of @time:

julia> using BenchmarkTools

julia> A = rand(10,10); T = UpperTriangular(A);

julia> @btime $A[3,5]; @btime $T[3,5];
  1.377 ns (0 allocations: 0 bytes)
  1.650 ns (0 allocations: 0 bytes)

Copy link
Member

@stevengj stevengj Jan 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It may be that the effect on the UpperTriangular case is too small to measure, however. Or maybe even LLVM is smart enough to figure out that i ≤ j is always true. But it wouldn't hurt to unwrap the type anyway.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After improving the type stability, the specialised implementation for UnitUpperTriangular is faster than converting.
Now, on 0.5.0 using @benchmarking, I see a slowdown when unwrapping the type first in sqrtm. Relatedly, on my machine (juliabox.com) there appears to be a slowdown for the atomic operations:

@benchmark $A[1,2]
  median time:      2.650 ns
@benchmark $T[1,2]
  median time:      2.223 ns

(what does the $ do here?)
This persists when wrapping these calls in a function. All in all I am hesitant to make the unwrapping change for sqrtm (but not against)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you unwrap the type? I hope you didn't do A = A.data, which hurts type stability.

I can't reproduce your benchmark results. Maybe the benchmarks aren't reliable for such a short operation, and one needs to put a bunch of indexing operations in a loop? (Probably you also want to time them with @inbounds). The $ splices the value of the variable directly into the expression, which gets rid of the penalty for benchmarking with a global variable (it means the compiler does not have to do runtime type inference to find the type of A).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example:

julia> function sumupper(A)
           s = zero(eltype(A))
           for j = 1:size(A,2), i = 1:j
               s += A[i,j]
           end
           return s
       end
sumupper (generic function with 1 method)

julia> A = rand(100,100); T = UpperTriangular(A);

julia> using BenchmarkTools

julia> sumupper(A) == sumupper(T)
true

julia> @btime sumupper($A); @btime sumupper($T);
  4.122 μs (0 allocations: 0 bytes)
  5.102 μs (0 allocations: 0 bytes)

@kshyatt kshyatt added linear algebra Linear algebra performance Must go faster labels Jan 24, 2017
@felixrehren
Copy link
Contributor Author

felixrehren commented Jan 25, 2017

@stevengj great idea, implemented 👍 In my testing, I found that floop still offered a performance improvement when used in conjuction (that I couldn't replicate by "manually inlining" floop). Not sure why that would be, but for the time being I left the use of floop as is. Happy to modify further

@andreasnoack I'm preparing a PR for BaseBenchmarks 👍

@@ -1889,14 +1888,18 @@ function sqrtm{T}(A::UpperTriangular{T})
end
end
end
sqrtm(A::UpperTriangular,Val{realmatrix})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ::UpperTriangular type assertion isn't usually used on call sites

@@ -1879,19 +1888,20 @@ function sqrtm{T}(A::UpperTriangular{T})
end
end
end
sqrtm(A,Val{realmatrix})
end
function sqrtm{T,realmatrix}(A::UpperTriangular{T},::Type{Val{realmatrix}})
if realmatrix
TT = typeof(sqrt(zero(T)))
else
TT = typeof(sqrt(complex(-one(T))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use sqrt(complex(zero(T))) here, since at some point one may return a dimensionless value if T is dimensionful.

@@ -1867,7 +1867,16 @@ function logm{T<:Union{Float64,Complex{Float64}}}(A0::UpperTriangular{T})
end
logm(A::LowerTriangular) = logm(A.').'

function sqrtm{T}(A::UpperTriangular{T})
function floop(x,R,i::Int,j::Int)
r = x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not type stable if A[i,j] and R[i,j] are not the same type. A simple fix is r = x + zero(eltype(R))

@stevengj
Copy link
Member

@felixrehren, have you tried @code_warntype to see if there are any type instabilities in sqrtm(A, Val{...})?

for j = 1:n
R[j,j] = realmatrix ? sqrt(A[j,j]) : sqrt(complex(A[j,j]))
for i = j-1:-1:1
r = A[i,j] + zero(TT)
Copy link
Member

@stevengj stevengj Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even simpler:

r::TT = A[i,j]

Though (like a lot of our generic linear-algebra code), this isn't right for dimensionful quantities, because r has the units of A whereas R has the units of sqrt(A). If we want to get that right, it should really be something like:

TA = realmatrix ? float(T) : Complex{float(T)}

and then use r::TA.

Copy link
Contributor Author

@felixrehren felixrehren Jan 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, do you suggest replacing

TT = realmatrix ? typeof(sqrt(zero(T))) : typeof(sqrt(complex(zero(T))))

with

TT = realmatrix ? float(T) : Complex{float(T)}

or would TA be supplemental to TT? Because r should have the units of R = sqrtm(A)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TA would be supplemental to TT, because A and R have different units.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing a new type variable then you can simply do

r = A[i,j] - zero(TT)*zero(TT)

which matches the operation in the loop and therefore correct. We should avoid explicit float calls if possible since it requires that float is implemented correctly for new types (which might not be the case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this is addressed -- I guess we should write some tests? (for a different PR ...)

end
n = checksquare(A)
n = Base.LinAlg.checksquare(A)
R = zeros(TT, n, n)
Copy link
Member

@stevengj stevengj Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

~~My inclination would be to use R = Matrix{TT}(n, n) here. Then get rid of the r == 0 check below. sqrtm is unlikely to return a sparse result, so I don't see the point of pre-initializing the array to zero.~~~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nevermind, I forgot that this is for the upper-triangular case. Then I guess initializing it to zero makes sense.

@felixrehren
Copy link
Contributor Author

felixrehren commented Jan 26, 2017

@stevengj Thanks for the comments, I am also learning a lot!
I used @code_warntype and, with the suggestions you made that are already implemented, there are no more warnings or unions in the output of sqrtm(A,Val{realmatrix})

I think the AppVeyor failure is unrelated?

Error in testset subarray:
Error During Test
  Test threw an exception of type ErrorException
  Expression: X[1,1:end - 2] == @views(X[1,1:end - 2])
  Partial linear indexing is deprecated. Use `reshape(A, Val{2})` to make the dimensionality of the array match the number of indices.

@stevengj
Copy link
Member

stevengj commented Jan 26, 2017

The "Partial linear indexing" thing was just fixed by #20242

replace `TT` by `t` for the type of the sqrt of a variable of type `T`
introduce `tt` as the type of the square of a variable of type `t`
N.B. `tt` is not always the same as `T`, it could be `Complex{T}`
In the `UnitUpperTriangular` case, some of the complexity should fall away; TODO?
@simd for k = i+1:j-1
r -= R[i,k]*R[k,j]
end
r==0 || (R[i,j] = r/one)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one won't be equivalent to the former (R[i,i] + R[j,j]), will it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it equivalent to 2*R[1,1] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is for the UnitUpperTriangular case only, R[i,i] = R[1,1]

Copy link
Contributor

@tkelman tkelman Jan 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. Missed the UnitUpperTriangular since that line was on the other side of a long conversation.

correct previous change, thx tkelman
@ViralBShah
Copy link
Member

ViralBShah commented Jan 27, 2017

I know this is not a blocker for 0.6, but would be nice to get it in as we plan for an alpha release. Marking it 0.6 mainly as a reminder, but feel free to take off the 0.6 tag if necessary.

@ViralBShah ViralBShah added this to the 0.6.0 milestone Jan 27, 2017
@tkelman tkelman removed this from the 0.6.0 milestone Jan 27, 2017
@tkelman
Copy link
Contributor

tkelman commented Jan 27, 2017

Please don't put "nice to have" on the milestone.

@felixrehren
Copy link
Contributor Author

Think this is it -- @stevengj, @andreasnoack ?

@simd for k = i+1:j-1
r -= R[i,k]*R[k,j]
end
r==0 || (R[i,j] = half*r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old version of this function might not have worked correctly in this case either, but would this do the right thing if the element type was not commutative under multiplication?

Copy link
Member

@stevengj stevengj Jan 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above for the general case. However, even for non-commutative algebras the case here is special because R[i,i] is the identity, which always commutes with everything. So, the problem reduces to solving R[i,j] + R[i,j] = r. If the elements form a field over the reals, then the solution R[i,j] = r/2 is correct.

The case here is also wrong if e.g. the elements are matrices, in which case 1/(2*R[1,1]) will throw an error, although if you do half = inv(2*R[1,1]) it will work. However, it would be even better to just to R[i,j] = r/2 in that case, since it would avoid the matrix multiplication.

However, it is still wrong if the elements are some other number field where you can't do /2. The correct, general solution seems like it should be

half = inv(R[1,1] + R[1,1]) # don't use 1/(2R[1,1]) or half=0.5, to handle general algebraic types

@stevengj
Copy link
Member

(Note that this PR can go in after the feature freeze since it is just a performance optimization.)

@simd for k = i+1:j-1
r -= R[i,k]*R[k,j]
end
r==0 || (R[i,j] = r / (R[i,i] + R[j,j]))
Copy link
Member

@stevengj stevengj Jan 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @tkelman's comment below, for non-commutative number types (e.g. quaternions), this is not right.

I haven't gone through the algebra very carefully, but I think the final R[i,j] needs to solve R[i,i]*R[i,j] + R[i,j]*R[j,j] == r, i.e. a Sylvester equation. So, the right thing to do seems like:

R[i,j] = sylvester(R[i,i], R[j,j], -r)

and then add a method

sylvester(a::Union{Real,Complex},b::Union{Real,Complex},c::Union{Real,Complex}) = 
    (-c) / (a + b)

for the commutative Number cases. New Number types will have then to provide their own sylvester method if they want sqrtm to work. (We already have sylvester methods for matrices thanks to #7435.)

sqrtm(A,Val{realmatrix})
end
# solve the sylvester equation a*x + x*b + c for x when a,b,x are commutative numbers. PR#20214
sylvester(a::Union{Real,Complex},b::Union{Real,Complex},c::Union{Real,Complex}) = -c / (a + b)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this sylvester definition should go into complex.jl or similar? And then you'll need to make sure that the LinAlg module imports Base.sylvester so that it extends the base definition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would parenthesize (-c). That way if you call sylvester(a, b, -r), the compiler will have a good shot at noticing that the two negations cancel.

Copy link
Member

@stevengj stevengj Jan 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, it looks like LLVM is smart enough to eliminate the double negation either way:

julia> syl(a,b,c) = -c / (a + b)
syl (generic function with 1 method)

julia> foo(a,b,c) = syl(a,b,-c)
foo (generic function with 1 method)

julia> @code_llvm foo(1.0,2.0,3.0)

define double @julia_foo_65538(double, double, double) #0 !dbg !5 {
top:
  %3 = fadd double %0, %1
  %4 = fdiv double %2, %3
  ret double %4
}

@stevengj
Copy link
Member

LGTM except that the definition of sylvester should go elsewhere. Maybe in linalg/dense.jl with the other sylvester methods for now.

@stevengj
Copy link
Member

Also, you don't need to put PR #.... comments on lines... that's what git blame is for.

@felixrehren
Copy link
Contributor Author

@stevengj Ok, will do. Should I keep any of the comments or clean it all up?

@stevengj
Copy link
Member

You should have a comment on code that does something in a non-obvious way, to make sure no one changes it without realizing, but your comments don't have to mention the PR.

@felixrehren
Copy link
Contributor Author

Think all comments so far are addressed, let me know if it could use further improvements

@stevengj
Copy link
Member

stevengj commented Feb 3, 2017

LGTM. Is the overall speedup still 15x?

@felixrehren
Copy link
Contributor Author

With all the nice tweaks, I see x20 in this benchmarking notebook on Julia 0.5

@andreasnoack andreasnoack merged commit 9c067b6 into JuliaLang:master Feb 3, 2017
@felixrehren felixrehren deleted the fr-sqrt branch February 3, 2017 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants