Giant rewrite for Julia v0.5 #102

jrevels · 2016-02-08T23:03:56Z

There's been a bunch of discussion between ForwardDiff developers about where this package should go, and with v0.5 on the way, I think now is the right time to start implementing some of these changes. A lot of these changes are breaking and depend on each other, so it's unfortunately going to be one monolithic PR if we want to avoid intermittent breakage.

After this PR, it will be time to upgrade ForwardDiff's version fromv0.1.x to v0.2.x. We can maintain a release-0.1 branch for backporting purposes.

This PR is only about 2/3 of the way complete, but aims to fix #36, #88, #92, #94, #96. It paves the way for us to address #83, #89, and #98.

Major Changes:

Drop Vector storage for epsilons components. This reduces indirection in Partials code that had to handle multiple container types. I have yet to see a case where Tuple storage wasn't faster due to GC overhead.
Consolidate all ForwardDiffNumber types into the new type Dual{N,T}, which is structured similarly to the former GradientNumber. HessianNumber and TensorNumber have been replaced by the ability to nest Duals. This allows for Tuple storage of higher-order partial components, cuts out a lot of code, and should allow for more cache-friendly higher-order API methods (since indexing patterns will be more straightforward).
All @generated functions have been removed, and the API has been simplified by introducing the Chunk immutable and allowing subtypes of ForwardDiffResults to be passed as mutable arguments to the API functions. The documentation will explain this in more detail, once I write it.
Remove the function-generating API functions. I think that it's now easier and more transparent for people to define their own closures, e.g. j(x) = ForwardDiff.jacobian(f, x).
The code is now generic enough that higher-order/higher-dimensional derivatives can be written using lower-order API functions. Thus, tensor/tensor! has been removed, but is still easily implementable by users.
Experimental multithreading support for parallel chunk-mode can be enabled on some API functions by passing in multithread = true. I'm getting a 2x speed up vs. the single-threaded implementation when using 4 threads to take the gradient of rosenbrock with large (> 10,000 elements) input vectors. I haven't benchmarked enough to know how this scales, though I'd expect it to asymptote to N-times speed-up for N threads as the input length goes to infinity.
The caching layer is thread-safe for multithreaded API functions, and now features some optimizations proposed by @KristofferC to reduce the number of key lookups per API function call.

TODO:

jacobian/jacobian!
hessian/hessian!
API Tests
Do some benchmarking. From my naive benchmarks, this branch's gradient methods perform better than master w.r.t. both time and memory when using Julia v0.4, and way better when using Julia v0.5. I'd like to get a similar speed-up with the other API methods, which I've yet to re-implement.
Update documentation, and provide a guide for upgrading from v0.1.x to v0.2.x

Major Changes: - Drop Vector storage for Partials. This reduces indirection in code that had to handle multiple container types. I have yet to see a case where Tuple storage wasn't faster due to GC overhead. - Consolidate all ForwardDiffNumber types into the new type DiffNumber, which is structured like a GradientNumber. HessianNumber and TensorNumber have been replaced by the ability to nest DiffNumbers. This allows for Tuple storage of higher-order partial components, and cuts out a lot of code. - API functions have been replaced by macros to allow type stable passage of constant kwargs. This includes some small breaking changes. For example, the target function is ALWAYS the first argument to the new API functions (to maintain consistency with other higher-order mutation functions in Base, like `map!`). Additionally, some kwarg names have changed. - experimental multithreading support can be enabled on some API functions by passing in `multithread = true` - the caching layer has been simplified, and is now thread-safe.

mlubin · 2016-02-09T01:46:33Z

.travis.yml

@@ -1,6 +1,5 @@
 language: julia
 julia:
-    - 0.4


Is this branch dropping support for 0.4? Why?

Threads doesn't exist in 0.4, so trying to load the package will fail. I guess I could wrap all the multithreading code in conditionals against VERSION.

Yes please, otherwise JuMP (and anyone who has JuMP installed) will be stuck on ForwardDiff 0.1 for a long time

KristofferC · 2016-02-09T12:30:03Z

Coo stuff! I am impressed you managed to reduce the different Number types to one.

Some thoughts about the cache implementation. From previous experience I have found that caching arrays in a Dict is in some cases just too slow. Especially now with the removal of the vector gradient numbers the vectors used will tend to be small so the overhead will be significant.

One possible idea to improve performance of the cache without any impact on the API:

Instead of having the cache hold arrays, we let it hold types that contain as many arrays is needed for the problem (jacobian, gradient etc). The accesses to the dict (get_workvec!, get_partials!, get_zeros!) would collapse into one call to the dict, i.e. get_cache. This would look in the cache to see if a type of the combinations of relevant parameters is stored. If not, we return a vector pf new instance of the needed type is created and returned, (vector of same length as number of threads). This type would contain all the needed arrays workvec, partials, zeros and they can simply be accessed as a field for free.

This requires a bit of extra code to implement the different caches but would significantly reduce the number of typeasserts and calls to the dict needed.

My second comment is that #99 still seems relevant and could with the macros now be implemented in a type stable way. The hardest part would be to figure out a clean API for it.

Regarding SIMD there has been some discussions about it starting here: eschnett/SIMD.jl#1 (comment). With JuliaLang/julia#14976 SIMD.jl is able to generate very nice code for Vecs but there is still some problems when these Vecs are put inside a type. ArchRobinson said that he would look a bit on the generated code for this case.

I am willing to put in work to implement this but would be good to have some discussions with others to know if this is something more people support.

KristofferC · 2016-02-09T21:06:07Z

I implemented the first idea in: KristofferC@1a65099

Benchmarks with:

function rosenbrock(x)
          a = one(eltype(x))
          b = 100 * a
          result = zero(eltype(x))
          for i in 1:length(x)-1
              result += (a - x[i])^2 + b*(x[i+1] - x[i]^2)^2
          end
          return result
      end

That commit:

julia> @time for i = 1:10^5 g(rand(5)) end
  0.166368 seconds (900.00 k allocations: 39.673 MB, 5.06% gc time)

julia> @time for i = 1:10^5 g(rand(20)) end
  0.292666 seconds (900.00 k allocations: 61.035 MB, 2.32% gc time)

julia> @time for i = 1:10^5 g(rand(35)) end
  0.624599 seconds (900.00 k allocations: 94.604 MB, 2.74% gc time)

This branch:

julia> @time for i = 1:10^5 g(rand(5)) end
  0.352903 seconds (800.00 k allocations: 35.095 MB, 1.12% gc time)

julia> @time for i = 1:10^5 g(rand(20)) end
  0.508113 seconds (800.00 k allocations: 56.458 MB, 1.23% gc time)

julia> @time for i = 1:10^5 g(rand(35)) end
  0.986225 seconds (800.00 k allocations: 90.027 MB, 1.27% gc time)

Didn't test it with multithreading yet.

It currently wastes a bit of space (allocating for all threads even if running single threaded and allocates the remainder Partials on all threads) but this was just a performance test,

jrevels · 2016-02-09T23:25:17Z

I liked your idea in #99, I was actually going to eventually use that idea here. Your approach for separating individual threads' cache data also saves NTHREADS-1 lookups compared to what I was doing (just storing a separate cache for each thread).

If you want to implement your cache idea as follow-up PR to this one, that would be awesome (otherwise, I'll incorporate your ideas later).

jrevels · 2016-02-09T23:29:24Z

but there is still some problems when these Vecs are put inside a type. ArchRobinson said that he would look a bit on the generated code for this case.

Makes sense. When I started this refactor, I tried rewriting Partials to hold a Vec instead of an NTuple (It looks like you actually did the same experiment), but didn't see any vectorization happening on Julia v0.5 on my machine. SIMD stuff is tricky.

KristofferC · 2016-02-10T13:10:44Z

src/gradient.jl

+        if N == L
+            body = VEC_MODE_EXPR
+        else
+            nthreads = Threads.nthreads()


These need to be either Base.Threads or a using Base.Threads is needed, I think.

…v0.4

KristofferC · 2016-02-10T22:20:19Z

Thank you for your comments. I have a few busy days coming up but when I get time I will try flesh out the cache stuff unless you get to it before me. I'll open a PR against this branch when something is sort of ready.

I thought a bit about benchmarking as well. Since I am likely to do some benchmarking during the cache stuff would it make sense to implement those benchmarks on top of BenchmarkTrackers? Similar to how BaseBenchmarks is done. What do you think?

As a side note I though I would mention that to me the code is a bit harder to read and reason about now than before. There are a few places like

@generated function call_gradient!{S,N,L,M}(f, output::Vector{S}, x::Vector, ::Type{Val{N}}, ::Type{Val{L}}, ::Type{Val{M}})

where I am having troubles remembering which length or type L and M are etc. I'm sure I will grok it in time though :)

KristofferC · 2016-02-11T08:52:46Z

Regarding a global const cache. Other packages have gotten problems with similar implementation, i.e: jump-dev/Convex.jl#83. I guess the problem is that you never know how the user will call the code.

We could either document the clearcache! function (should probably have a sizeof function for the cache then as well) or we just reset the cache after it reaches a certain threshold.

jrevels · 2016-02-11T18:06:17Z

As a side note I though I would mention that to me the code is a bit harder to read and reason about now than before.

Yeah, there are a lot of mysterious single-letter identifiers at the moment. I'll change the type parameter names so that they reflect their corresponding keywords, should make it slightly less confusing.

We could either document the clearcache! function (should probably have a sizeof function for the cache then as well) or we just reset the cache after it reaches a certain threshold.

Transparent documentation and a sizeof function are both good ideas, if we end up keeping a global cache as is presented here.

I'm just going to save any additional cache refactoring for when the rest of this PR is complete, which will probably take some time, given my current schedule. If #105 is ready by that point, I'll just defer to that PR. Either way, you're gathering some valuable information on where the pain points are, and what the right way to solve them might be. Thanks for looking into this!

KristofferC · 2016-02-11T23:21:33Z

#105 is shaping up ok. What is left is adding the functionality to explicitly pass a created cache as a keyword argument to the macros.

What do you think about removing the option to set the chunk size? It would make some stuff simpler because then the chunk size is simply a function of the input length.

jrevels · 2016-02-12T17:06:11Z

What do you think about removing the option to set the chunk size?

I've thought about it, but we'd have to be really confident in our chunk-picking heuristic. The right chunk size for any given problem depends not only on the dimension of the input, but on the performance characteristics of the target function itself. If we forced the user to defer to our heuristic with no option to override it, I'd want to make sure that the dependence on the target function is near negligible in the general case.

mlubin · 2016-02-12T17:34:02Z

What do you think about removing the option to set the chunk size?

Given that we have no good sense of how to choose a good chunk size, I think we should leave this as an option.

KristofferC · 2016-02-12T18:37:58Z

Ok. Makes sense!

KristofferC · 2016-02-14T19:26:31Z

The segfault in the tests on 0.5 goes away for me with --inline=no. Hmm.

Edit: It goes away when doing many things

jrevels · 2016-02-17T00:42:08Z

Edit: It goes away when doing many things

Yeah, it's a finicky one. When I added these tests, I played around for a little bit to try to come up with a minimal repro of the segfault, to no avail. Part of the reason I pushed was to see if anybody else could figure it out 😛 I would guess this bug is similar to JuliaLang/julia#15043, but I'm not totally sure how it actually behaves yet.

IIRC, I found that call_gradient wasn't correctly regenerating for different values of CHUNK. Instead, CHUNK was propagated as a TypeVar even when it should've been bound to an Int value. Sometimes this would result in MethodErrors (e.g. no method matching calc_gradient!(::f, ::Array{Float64, 1}, ::Array{Float64, 1}, ::Type{Val{CHUNK}}, ::Type{Val{10}})) and other times calc_gradient! actually got called, but segfaulted.

Substituting uses of CHUNK with CHUNK + 0 in call_gradient! fixed the problem in some cases by causing the resulting value to be correctly inferenced as an Int, but obviously that's a hack around the underlying problem.

KristofferC · 2016-02-18T09:22:27Z

Maybe it is worth making an issue upstream?

I played around a bit with trying to reduce it and if I removed some tests I stopped getting segfaults and instead got some GC corruption crash. Would be interesting to go back before jb/functions and see if that was what made it fail. Since I have no life, maybe I'll do a bisect...

jrevels · 2016-06-15T18:54:35Z

Everything here is now done, besides getting tests to pass on Travis - I have no idea what's going on there.

If anybody is curious about benchmarks, here's the difference between some benchmarks ran on master and this branch on Julia v0.4 (the data is located in the benchmarks folder so you can play with it yourself):

julia> using JLD, BenchmarkTools

julia> g41 = JLD.load("benchmark/results_julia_v04_forwarddiff_v01.jld", "results");

julia> g42 = JLD.load("benchmark/results_julia_v04_forwarddiff_v02.jld", "results");

julia> compare(a, b, k) = judge(minimum(a[k]), minimum(b[k]))
compare (generic function with 1 method)

julia> compare(g42, g41, "derivative")
6-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "deriv_test_6" => TrialJudgement(-55.51% => improvement)
  "deriv_test_4" => TrialJudgement(-95.39% => improvement)
  "deriv_test_5" => TrialJudgement(-96.40% => improvement)
  "deriv_test_2" => TrialJudgement(-94.97% => improvement)
  "deriv_test_1" => TrialJudgement(-88.74% => improvement)
  "deriv_test_3" => TrialJudgement(-89.70% => improvement)

julia> compare(g42, g41, "gradient")
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "rosenbrock" => TrialJudgement(-21.89% => improvement)
  "ackley" => TrialJudgement(+0.11% => invariant)
  "self_weighted_logit" => TrialJudgement(-36.30% => improvement)

julia> compare(g42, g41, "hessian")
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "rosenbrock" => TrialJudgement(-81.98% => improvement)
  "ackley" => TrialJudgement(-80.44% => improvement)
  "self_weighted_logit" => TrialJudgement(-78.50% => improvement)

julia> compare(g42, g41, "jacobian")
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "trigonometric" => TrialJudgement(-10.58% => improvement)
  "chebyquad" => TrialJudgement(-10.07% => improvement)
  "brown_almost_linear" => TrialJudgement(-9.19% => improvement)

KristofferC · 2016-06-15T18:58:55Z

That's what you get for using case insensitive file systems ;) File is called Partials.j but you are including partials.jl.

jrevels · 2016-06-15T19:52:27Z

lol, thanks! I never would've realized that the local file's case didn't matching the remote file's case:

coveralls · 2016-06-15T20:10:27Z

Coverage decreased (-10.8%) to 80.484% when pulling 967a31a on giant-refactor into a65ac31 on master.

tkelman · 2016-07-23T15:12:36Z

src/ForwardDiff.jl

-                return ($F){N,R,switch_eltype(C, R)}
-            end
-        end
+if v"0.4" <= VERSION < v"0.5-"


please, please, please be more specific with VERSION dependent code like this

jrevels force-pushed the giant-refactor branch from 679ce71 to d37f375 Compare February 8, 2016 23:06

mlubin reviewed Feb 9, 2016
View reviewed changes

jrevels force-pushed the giant-refactor branch from 2f43135 to 6f311f0 Compare February 9, 2016 03:18

mlubin mentioned this pull request Feb 9, 2016

allow AbstractVector input for gradients #103

Merged

KristofferC reviewed Feb 10, 2016
View reviewed changes

Wrap multithreading code in VERSION conditionals in order to support …

a8a1f80

…v0.4

jrevels force-pushed the giant-refactor branch from 6f311f0 to a8a1f80 Compare February 10, 2016 14:43

jrevels added 2 commits February 12, 2016 17:50

change default chunk threshold

5ceafd2

more clearly named type parameters

d14c236

jrevels force-pushed the giant-refactor branch 2 times, most recently from f61f6ec to c0038b9 Compare February 12, 2016 23:07

add more gradient tests

cf1024d

jrevels force-pushed the giant-refactor branch from c0038b9 to cf1024d Compare February 12, 2016 23:21

jrevels force-pushed the giant-refactor branch from d1effac to 20c2fc0 Compare February 21, 2016 01:40

clean up messy type parameter handling and fix segfaults

f8433d1

jrevels force-pushed the giant-refactor branch from 5d4757a to 5652772 Compare June 14, 2016 19:57

jrevels added 2 commits June 15, 2016 09:49

cleanup cache layer

465f9a5

add documentation

e178628

jrevels force-pushed the giant-refactor branch 5 times, most recently from e69566d to d32c678 Compare June 15, 2016 18:41

update benchmarks

6208759

jrevels force-pushed the giant-refactor branch from d32c678 to 6208759 Compare June 15, 2016 18:45

jrevels added 3 commits June 15, 2016 15:31

adjust MAX_CHUNK_SIZE to equal to vector mode threshold

2e19c28

better test chunk sizes

8b12ccf

fix file name

349a972

Merge branch 'master' into giant-refactor

967a31a

jrevels merged commit 934b3b2 into master Jun 15, 2016

jrevels deleted the giant-refactor branch June 15, 2016 20:16

tkelman reviewed Jul 23, 2016
View reviewed changes

jrevels mentioned this pull request Jul 24, 2016

Replace operator macro with compat macro, bump required Compat version #134

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Giant rewrite for Julia v0.5 #102

Giant rewrite for Julia v0.5 #102

jrevels commented Feb 8, 2016 •

edited

Loading

mlubin Feb 9, 2016

jrevels Feb 9, 2016

mlubin Feb 9, 2016

KristofferC commented Feb 9, 2016

KristofferC commented Feb 9, 2016

jrevels commented Feb 9, 2016

jrevels commented Feb 9, 2016

KristofferC Feb 10, 2016

KristofferC commented Feb 10, 2016

KristofferC commented Feb 11, 2016

jrevels commented Feb 11, 2016

KristofferC commented Feb 11, 2016

jrevels commented Feb 12, 2016

mlubin commented Feb 12, 2016

KristofferC commented Feb 12, 2016

KristofferC commented Feb 14, 2016

jrevels commented Feb 17, 2016

KristofferC commented Feb 18, 2016

jrevels commented Jun 15, 2016

KristofferC commented Jun 15, 2016

jrevels commented Jun 15, 2016

coveralls commented Jun 15, 2016 •

edited

Loading

tkelman Jul 23, 2016

Giant rewrite for Julia v0.5 #102

Giant rewrite for Julia v0.5 #102

Conversation

jrevels commented Feb 8, 2016 • edited Loading

mlubin Feb 9, 2016

Choose a reason for hiding this comment

jrevels Feb 9, 2016

Choose a reason for hiding this comment

mlubin Feb 9, 2016

Choose a reason for hiding this comment

KristofferC commented Feb 9, 2016

KristofferC commented Feb 9, 2016

jrevels commented Feb 9, 2016

jrevels commented Feb 9, 2016

KristofferC Feb 10, 2016

Choose a reason for hiding this comment

KristofferC commented Feb 10, 2016

KristofferC commented Feb 11, 2016

jrevels commented Feb 11, 2016

KristofferC commented Feb 11, 2016

jrevels commented Feb 12, 2016

mlubin commented Feb 12, 2016

KristofferC commented Feb 12, 2016

KristofferC commented Feb 14, 2016

jrevels commented Feb 17, 2016

KristofferC commented Feb 18, 2016

jrevels commented Jun 15, 2016

KristofferC commented Jun 15, 2016

jrevels commented Jun 15, 2016

coveralls commented Jun 15, 2016 • edited Loading

tkelman Jul 23, 2016

Choose a reason for hiding this comment

jrevels commented Feb 8, 2016 •

edited

Loading

coveralls commented Jun 15, 2016 •

edited

Loading