-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear Algebra derivates are slow #154
Comments
Some breadcrumbs at #89 (comment) and subsequent comments. |
@KristofferC beat me to the link. Playing around with linear algebra operations could be a fruitful endeavor, but can be tricky. Note that not using globals (
You can play around with changing the chunk size by changing |
Here I got some questions, I hope they aren't posed to monologuesque... What exactly is the advantage of using stack allocated tuples? Concerning the allocations, I assume these happen mainly in elementary methods ( But how is an I must be wrong by now ... :D Maybe you can help me out :)? P.S.: |
The difference is if each Dual number store a Vector on the heap for its partials or if a Dual number is just a continuous chunk of memory. If each Dual number has a separate vector for partials then you need to allocate a new one for the resulting partials in every operation between Dual numbers. Chucking can reduce things like register spills or rather any bad effect when you have too much data on the stack. |
In axsk@2106e92 I optimized the idea from above: convert For
Increasing the chunksize reduces the runtime for this specific usecase beyond the 10 chunk border. I also tried adding a testcase for matrix multiplication (axsk@7c4bc6b) to make use of the benchmarks, since I suspect this might be slower for small dimensions, but I could not get it to run yet... |
For this specific case maximum chunk size is probably good since you only extract them to a heap allocated array and back. The problem is when you have a computation with many dual numbers at the same time. |
What's the specific issue you're seeing? I only skimmed the code you provided (so I could've missed something), but it seems like such an optimization would only work in vector-mode (i.e. when the chunk size equals the input dimension). Maybe that's what you're running into? Anyway, figuring out a way to optimize array functions would be great, I appreciate you looking into it. I should warn you that it might not be the worth the effort in the long run, though - I have a reverse-mode package in the works that should be far more efficient for these kinds of gradients, as it's highly amenable to linear algebra optimizations. |
I'm going to close this, since you should really use ReverseDiff over ForwardDiff for linear algebra functions with input dimensions this large. ReverseDiff should be released very soon, and is stable and well-documented enough for people to start using it (just keep in mind you'll need to use the latest version of ForwardDiff). Here's what ReverseDiff looks like for the problem in the OP: julia> using ReverseDiff
julia> const A = rand(100_000, 300);
julia> f(x) = sum(A * x);
julia> const ∇f! = ReverseDiff.compile_gradient(f, rand(300));
julia> out, x = zeros(300), rand(300);
julia> @benchmark ∇f!($out, $x)
BenchmarkTools.Trial:
memory estimate: 0.00 bytes
allocs estimate: 0
--------------
minimum time: 27.603 ms (0.00% GC)
median time: 27.881 ms (0.00% GC)
mean time: 30.427 ms (0.00% GC)
maximum time: 38.874 ms (0.00% GC)
--------------
samples: 165
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark sum($A, 1)
BenchmarkTools.Trial:
memory estimate: 2.50 kb
allocs estimate: 1
--------------
minimum time: 14.655 ms (0.00% GC)
median time: 14.723 ms (0.00% GC)
mean time: 14.892 ms (0.00% GC)
maximum time: 25.454 ms (0.00% GC)
--------------
samples: 336
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% |
Consider following piece of code:
The manual computation takes 0.09 seconds, but the ForwardDiff one takes 45s.
I agree that I am cheating with the manual derivative, since ForwardDiff has to take the jacobian internally. But even
@time sum(A*eye(300), 1)
takes just 2.2 seconds, computing the whole jacobian and reducing it to the gradient, which I think should be reachable by autodiff as well...I tried the following naiive implementation
which should make use of fast matrix multiplication, but it takes nearly as long, mainly because most of the time is spent in collecting the values/partials via the maps.
Also note that for simplicity this works only for Chunks{1}.
Looking at
@time A*rand(300)
= 0.07s vs@time A*rand(300,10)
= 0.16s. I still expect a 4x speedup for Chunks{10} in the actual computation.This also makes me wonder why 10 is the maximal chunksize, I suspect for this problem passing all partials at once should be most effective.
How would you go about speeding things up?
The text was updated successfully, but these errors were encountered: