Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear Algebra derivates are slow #154

Closed
axsk opened this issue Oct 20, 2016 · 8 comments
Closed

Linear Algebra derivates are slow #154

axsk opened this issue Oct 20, 2016 · 8 comments

Comments

@axsk
Copy link

axsk commented Oct 20, 2016

Consider following piece of code:

A=rand(100_000,300)
x=rand(300)

f(x) = sum(A*x)

@time sum(A,1) # manually compute the gradient

@time ForwardDiff.gradient(f, x, ForwardDiff.Chunk{1}())

The manual computation takes 0.09 seconds, but the ForwardDiff one takes 45s.
I agree that I am cheating with the manual derivative, since ForwardDiff has to take the jacobian internally. But even @time sum(A*eye(300), 1) takes just 2.2 seconds, computing the whole jacobian and reducing it to the gradient, which I think should be reachable by autodiff as well...

I tried the following naiive implementation

function Base.:*{T,N}(A::Matrix{T}, b::Vector{ForwardDiff.Dual{N,T}})
  v=map(ForwardDiff.value, d)
  p=map(d->d.partials.values[1], d)
  map((a,b) -> ForwardDiff.Dual(a, ForwardDiff.Partials((b,))), A*v, A*p)
end

which should make use of fast matrix multiplication, but it takes nearly as long, mainly because most of the time is spent in collecting the values/partials via the maps.

Also note that for simplicity this works only for Chunks{1}.
Looking at @time A*rand(300) = 0.07s vs @time A*rand(300,10) = 0.16s. I still expect a 4x speedup for Chunks{10} in the actual computation.
This also makes me wonder why 10 is the maximal chunksize, I suspect for this problem passing all partials at once should be most effective.

How would you go about speeding things up?

@KristofferC
Copy link
Collaborator

Some breadcrumbs at #89 (comment) and subsequent comments.

@jrevels
Copy link
Member

jrevels commented Oct 20, 2016

@KristofferC beat me to the link. Playing around with linear algebra operations could be a fruitful endeavor, but can be tricky. Note that not using globals (A and x are globals here) sped your example up to run in 14s for me.

This also makes me wonder why 10 is the maximal chunksize, I suspect for this problem passing all partials at once should be most effective.

You can play around with changing the chunk size by changing MAX_CHUNK_SIZE in ForwardDiff's source and reloading the package. I played around with it for this example just now, and only got slowdowns for chunk sizes greater than 10. Keep in mind that the partials are stack-allocated. To use your suggestion, a chunk size of 300 would involve allocating a 300-tuple of Float64 for almost every value involved in the computation. This would be pretty bad performance-wise (and would probably still be bad even if we heap-allocated it instead).

@axsk
Copy link
Author

axsk commented Oct 21, 2016

Keep in mind that the partials are stack-allocated.

Here I got some questions, I hope they aren't posed to monologuesque...

What exactly is the advantage of using stack allocated tuples?
Afaik, concerning pure runtime (up to allocation/gc, e.g. access times) they should be as fast as heap allocated arrays (up to size specific optimizations, where we have FixedSizeArrays).
By which means does cache behaviour differ though?

Concerning the allocations, I assume these happen mainly in elementary methods (+, sin, ...)?
Thus I see the advantage for e.g. f(x::Real) = sin(x) + x^2

But how is an Vector{Dual} allocated?
I am absolutely not sure about this, but my guess would be when a Vector is passed on to another function the values, i.e. Duals, get mapped to heap anyway. Thus the stack-allocation advantage would be gone for vector valued functions (up to the point where they dispatch to some of the above elementary Real functions), which operate on the heap anyway?

I must be wrong by now ... :D

Maybe you can help me out :)?

P.S.:
Does chunking achieve anything else then increasing the amount of cache hits?

@KristofferC
Copy link
Collaborator

The difference is if each Dual number store a Vector on the heap for its partials or if a Dual number is just a continuous chunk of memory. If each Dual number has a separate vector for partials then you need to allocate a new one for the resulting partials in every operation between Dual numbers.

Chucking can reduce things like register spills or rather any bad effect when you have too much data on the stack.

@axsk
Copy link
Author

axsk commented Oct 21, 2016

In axsk@2106e92 I optimized the idea from above: convert Vec{Dual} to a matrix, use the matrix multiplication for jacobian calculation and convert back to Vec{Dual}.

For @time testbigmult(100_000) I got a speedup from 44s (without any globals) to 5.9s.
This difference gets that extreme especially for n>50000.
Here some timings for tsn=[@elapsed(testbigmult(round(Int,n), 300)) for n in .^(10,0:.5:5)]

  n          new          old
      1.0  0.00125614   0.000297965
      3.0  0.00115709   0.00032128
     10.0  0.0024049    0.000862406
     32.0  0.00224434   0.00257956
    100.0  0.00505746   0.00762326
    316.0  0.0142125    0.0286606
   1000.0  0.0425227    0.0907818
   3162.0  0.164102     0.317257
  10000.0  0.529665     1.16811
  31623.0  2.00376      3.85566
 100000.0  5.93073     45.2478

Increasing the chunksize reduces the runtime for this specific usecase beyond the 10 chunk border.

I also tried adding a testcase for matrix multiplication (axsk@7c4bc6b) to make use of the benchmarks, since I suspect this might be slower for small dimensions, but I could not get it to run yet...
Maybe you see why?

@KristofferC
Copy link
Collaborator

For this specific case maximum chunk size is probably good since you only extract them to a heap allocated array and back. The problem is when you have a computation with many dual numbers at the same time.

@jrevels
Copy link
Member

jrevels commented Oct 21, 2016

I also tried adding a testcase for matrix multiplication (axsk/ForwardDiff.jl@7c4bc6b) to make use of the benchmarks, since I suspect this might be slower for small dimensions, but I could not get it to run yet...
Maybe you see why?

What's the specific issue you're seeing? I only skimmed the code you provided (so I could've missed something), but it seems like such an optimization would only work in vector-mode (i.e. when the chunk size equals the input dimension). Maybe that's what you're running into?

Anyway, figuring out a way to optimize array functions would be great, I appreciate you looking into it. I should warn you that it might not be the worth the effort in the long run, though - I have a reverse-mode package in the works that should be far more efficient for these kinds of gradients, as it's highly amenable to linear algebra optimizations.

@jrevels
Copy link
Member

jrevels commented Dec 5, 2016

I'm going to close this, since you should really use ReverseDiff over ForwardDiff for linear algebra functions with input dimensions this large. ReverseDiff should be released very soon, and is stable and well-documented enough for people to start using it (just keep in mind you'll need to use the latest version of ForwardDiff).

Here's what ReverseDiff looks like for the problem in the OP:

julia> using ReverseDiff

julia> const A = rand(100_000, 300);

julia> f(x) = sum(A * x);

julia> const ∇f! = ReverseDiff.compile_gradient(f, rand(300));

julia> out, x = zeros(300), rand(300);

julia> @benchmark ∇f!($out, $x)
BenchmarkTools.Trial:
  memory estimate:  0.00 bytes
  allocs estimate:  0
  --------------
  minimum time:     27.603 ms (0.00% GC)
  median time:      27.881 ms (0.00% GC)
  mean time:        30.427 ms (0.00% GC)
  maximum time:     38.874 ms (0.00% GC)
  --------------
  samples:          165
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark sum($A, 1)
BenchmarkTools.Trial:
  memory estimate:  2.50 kb
  allocs estimate:  1
  --------------
  minimum time:     14.655 ms (0.00% GC)
  median time:      14.723 ms (0.00% GC)
  mean time:        14.892 ms (0.00% GC)
  maximum time:     25.454 ms (0.00% GC)
  --------------
  samples:          336
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

@jrevels jrevels closed this as completed Dec 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants