Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giant rewrite for Julia v0.5 #102

Merged
merged 42 commits into from
Jun 15, 2016
Merged

Giant rewrite for Julia v0.5 #102

merged 42 commits into from
Jun 15, 2016

Conversation

jrevels
Copy link
Member

@jrevels jrevels commented Feb 8, 2016

There's been a bunch of discussion between ForwardDiff developers about where this package should go, and with v0.5 on the way, I think now is the right time to start implementing some of these changes. A lot of these changes are breaking and depend on each other, so it's unfortunately going to be one monolithic PR if we want to avoid intermittent breakage.

After this PR, it will be time to upgrade ForwardDiff's version fromv0.1.x to v0.2.x. We can maintain a release-0.1 branch for backporting purposes.

This PR is only about 2/3 of the way complete, but aims to fix #36, #88, #92, #94, #96. It paves the way for us to address #83, #89, and #98.

Major Changes:

  • Drop Vector storage for epsilons components. This reduces indirection in Partials code that had to handle multiple container types. I have yet to see a case where Tuple storage wasn't faster due to GC overhead.
  • Consolidate all ForwardDiffNumber types into the new type Dual{N,T}, which is structured similarly to the former GradientNumber. HessianNumber and TensorNumber have been replaced by the ability to nest Duals. This allows for Tuple storage of higher-order partial components, cuts out a lot of code, and should allow for more cache-friendly higher-order API methods (since indexing patterns will be more straightforward).
  • All @generated functions have been removed, and the API has been simplified by introducing the Chunk immutable and allowing subtypes of ForwardDiffResults to be passed as mutable arguments to the API functions. The documentation will explain this in more detail, once I write it.
  • Remove the function-generating API functions. I think that it's now easier and more transparent for people to define their own closures, e.g. j(x) = ForwardDiff.jacobian(f, x).
  • The code is now generic enough that higher-order/higher-dimensional derivatives can be written using lower-order API functions. Thus, tensor/tensor! has been removed, but is still easily implementable by users.
  • Experimental multithreading support for parallel chunk-mode can be enabled on some API functions by passing in multithread = true. I'm getting a 2x speed up vs. the single-threaded implementation when using 4 threads to take the gradient of rosenbrock with large (> 10,000 elements) input vectors. I haven't benchmarked enough to know how this scales, though I'd expect it to asymptote to N-times speed-up for N threads as the input length goes to infinity.
  • The caching layer is thread-safe for multithreaded API functions, and now features some optimizations proposed by @KristofferC to reduce the number of key lookups per API function call.

TODO:

  • jacobian/jacobian!
  • hessian/hessian!
  • API Tests
  • Do some benchmarking. From my naive benchmarks, this branch's gradient methods perform better than master w.r.t. both time and memory when using Julia v0.4, and way better when using Julia v0.5. I'd like to get a similar speed-up with the other API methods, which I've yet to re-implement.
  • Update documentation, and provide a guide for upgrading from v0.1.x to v0.2.x

Major Changes:

- Drop Vector storage for Partials. This reduces indirection in code that had to handle multiple container types. I have yet to see a case where Tuple storage wasn't faster due to GC overhead.

- Consolidate all ForwardDiffNumber types into the new type DiffNumber, which is structured like a GradientNumber. HessianNumber and TensorNumber have been replaced by the ability to nest DiffNumbers. This allows for Tuple storage of higher-order partial components, and cuts out a lot of code.

- API functions have been replaced by macros to allow type stable passage of constant kwargs. This includes some small breaking changes. For example, the target function is ALWAYS the first argument to the new API functions (to maintain consistency with other higher-order mutation functions in Base, like `map!`). Additionally, some kwarg names have changed.

- experimental multithreading support can be enabled on some API functions by passing in `multithread = true`

- the caching layer has been simplified, and is now thread-safe.
@@ -1,6 +1,5 @@
language: julia
julia:
- 0.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this branch dropping support for 0.4? Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threads doesn't exist in 0.4, so trying to load the package will fail. I guess I could wrap all the multithreading code in conditionals against VERSION.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please, otherwise JuMP (and anyone who has JuMP installed) will be stuck on ForwardDiff 0.1 for a long time

@KristofferC
Copy link
Collaborator

Coo stuff! I am impressed you managed to reduce the different Number types to one.

Some thoughts about the cache implementation. From previous experience I have found that caching arrays in a Dict is in some cases just too slow. Especially now with the removal of the vector gradient numbers the vectors used will tend to be small so the overhead will be significant.

One possible idea to improve performance of the cache without any impact on the API:

Instead of having the cache hold arrays, we let it hold types that contain as many arrays is needed for the problem (jacobian, gradient etc). The accesses to the dict (get_workvec!, get_partials!, get_zeros!) would collapse into one call to the dict, i.e. get_cache. This would look in the cache to see if a type of the combinations of relevant parameters is stored. If not, we return a vector pf new instance of the needed type is created and returned, (vector of same length as number of threads). This type would contain all the needed arrays workvec, partials, zeros and they can simply be accessed as a field for free.

This requires a bit of extra code to implement the different caches but would significantly reduce the number of typeasserts and calls to the dict needed.


My second comment is that #99 still seems relevant and could with the macros now be implemented in a type stable way. The hardest part would be to figure out a clean API for it.


Regarding SIMD there has been some discussions about it starting here: eschnett/SIMD.jl#1 (comment). With JuliaLang/julia#14976 SIMD.jl is able to generate very nice code for Vecs but there is still some problems when these Vecs are put inside a type. ArchRobinson said that he would look a bit on the generated code for this case.

I am willing to put in work to implement this but would be good to have some discussions with others to know if this is something more people support.

@KristofferC
Copy link
Collaborator

I implemented the first idea in: KristofferC@1a65099

Benchmarks with:

function rosenbrock(x)
          a = one(eltype(x))
          b = 100 * a
          result = zero(eltype(x))
          for i in 1:length(x)-1
              result += (a - x[i])^2 + b*(x[i+1] - x[i]^2)^2
          end
          return result
      end

That commit:

julia> @time for i = 1:10^5 g(rand(5)) end
  0.166368 seconds (900.00 k allocations: 39.673 MB, 5.06% gc time)

julia> @time for i = 1:10^5 g(rand(20)) end
  0.292666 seconds (900.00 k allocations: 61.035 MB, 2.32% gc time)

julia> @time for i = 1:10^5 g(rand(35)) end
  0.624599 seconds (900.00 k allocations: 94.604 MB, 2.74% gc time)

This branch:

julia> @time for i = 1:10^5 g(rand(5)) end
  0.352903 seconds (800.00 k allocations: 35.095 MB, 1.12% gc time)

julia> @time for i = 1:10^5 g(rand(20)) end
  0.508113 seconds (800.00 k allocations: 56.458 MB, 1.23% gc time)

julia> @time for i = 1:10^5 g(rand(35)) end
  0.986225 seconds (800.00 k allocations: 90.027 MB, 1.27% gc time)

Didn't test it with multithreading yet.

It currently wastes a bit of space (allocating for all threads even if running single threaded and allocates the remainder Partials on all threads) but this was just a performance test,

@jrevels
Copy link
Member Author

jrevels commented Feb 9, 2016

I liked your idea in #99, I was actually going to eventually use that idea here. Your approach for separating individual threads' cache data also saves NTHREADS-1 lookups compared to what I was doing (just storing a separate cache for each thread).

If you want to implement your cache idea as follow-up PR to this one, that would be awesome (otherwise, I'll incorporate your ideas later).

@jrevels
Copy link
Member Author

jrevels commented Feb 9, 2016

but there is still some problems when these Vecs are put inside a type. ArchRobinson said that he would look a bit on the generated code for this case.

Makes sense. When I started this refactor, I tried rewriting Partials to hold a Vec instead of an NTuple (It looks like you actually did the same experiment), but didn't see any vectorization happening on Julia v0.5 on my machine. SIMD stuff is tricky.

if N == L
body = VEC_MODE_EXPR
else
nthreads = Threads.nthreads()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be either Base.Threads or a using Base.Threads is needed, I think.

@KristofferC
Copy link
Collaborator

Thank you for your comments. I have a few busy days coming up but when I get time I will try flesh out the cache stuff unless you get to it before me. I'll open a PR against this branch when something is sort of ready.

I thought a bit about benchmarking as well. Since I am likely to do some benchmarking during the cache stuff would it make sense to implement those benchmarks on top of BenchmarkTrackers? Similar to how BaseBenchmarks is done. What do you think?

As a side note I though I would mention that to me the code is a bit harder to read and reason about now than before. There are a few places like

@generated function call_gradient!{S,N,L,M}(f, output::Vector{S}, x::Vector, ::Type{Val{N}}, ::Type{Val{L}}, ::Type{Val{M}})

where I am having troubles remembering which length or type L and M are etc. I'm sure I will grok it in time though :)

@KristofferC
Copy link
Collaborator

Regarding a global const cache. Other packages have gotten problems with similar implementation, i.e: jump-dev/Convex.jl#83. I guess the problem is that you never know how the user will call the code.

We could either document the clearcache! function (should probably have a sizeof function for the cache then as well) or we just reset the cache after it reaches a certain threshold.

@jrevels
Copy link
Member Author

jrevels commented Feb 11, 2016

As a side note I though I would mention that to me the code is a bit harder to read and reason about now than before.

Yeah, there are a lot of mysterious single-letter identifiers at the moment. I'll change the type parameter names so that they reflect their corresponding keywords, should make it slightly less confusing.

We could either document the clearcache! function (should probably have a sizeof function for the cache then as well) or we just reset the cache after it reaches a certain threshold.

Transparent documentation and a sizeof function are both good ideas, if we end up keeping a global cache as is presented here.

I'm just going to save any additional cache refactoring for when the rest of this PR is complete, which will probably take some time, given my current schedule. If #105 is ready by that point, I'll just defer to that PR. Either way, you're gathering some valuable information on where the pain points are, and what the right way to solve them might be. Thanks for looking into this!

@KristofferC
Copy link
Collaborator

#105 is shaping up ok. What is left is adding the functionality to explicitly pass a created cache as a keyword argument to the macros.

What do you think about removing the option to set the chunk size? It would make some stuff simpler because then the chunk size is simply a function of the input length.

@jrevels
Copy link
Member Author

jrevels commented Feb 12, 2016

What do you think about removing the option to set the chunk size?

I've thought about it, but we'd have to be really confident in our chunk-picking heuristic. The right chunk size for any given problem depends not only on the dimension of the input, but on the performance characteristics of the target function itself. If we forced the user to defer to our heuristic with no option to override it, I'd want to make sure that the dependence on the target function is near negligible in the general case.

@mlubin
Copy link
Contributor

mlubin commented Feb 12, 2016

What do you think about removing the option to set the chunk size?

Given that we have no good sense of how to choose a good chunk size, I think we should leave this as an option.

@KristofferC
Copy link
Collaborator

Ok. Makes sense!

@jrevels jrevels force-pushed the giant-refactor branch 2 times, most recently from f61f6ec to c0038b9 Compare February 12, 2016 23:07
@KristofferC
Copy link
Collaborator

The segfault in the tests on 0.5 goes away for me with --inline=no. Hmm.

Edit: It goes away when doing many things

@jrevels
Copy link
Member Author

jrevels commented Feb 17, 2016

Edit: It goes away when doing many things

Yeah, it's a finicky one. When I added these tests, I played around for a little bit to try to come up with a minimal repro of the segfault, to no avail. Part of the reason I pushed was to see if anybody else could figure it out 😛 I would guess this bug is similar to JuliaLang/julia#15043, but I'm not totally sure how it actually behaves yet.

IIRC, I found that call_gradient wasn't correctly regenerating for different values of CHUNK. Instead, CHUNK was propagated as a TypeVar even when it should've been bound to an Int value. Sometimes this would result in MethodErrors (e.g. no method matching calc_gradient!(::f, ::Array{Float64, 1}, ::Array{Float64, 1}, ::Type{Val{CHUNK}}, ::Type{Val{10}})) and other times calc_gradient! actually got called, but segfaulted.

Substituting uses of CHUNK with CHUNK + 0 in call_gradient! fixed the problem in some cases by causing the resulting value to be correctly inferenced as an Int, but obviously that's a hack around the underlying problem.

@KristofferC
Copy link
Collaborator

Maybe it is worth making an issue upstream?

I played around a bit with trying to reduce it and if I removed some tests I stopped getting segfaults and instead got some GC corruption crash. Would be interesting to go back before jb/functions and see if that was what made it fail. Since I have no life, maybe I'll do a bisect...

@jrevels jrevels force-pushed the giant-refactor branch 5 times, most recently from e69566d to d32c678 Compare June 15, 2016 18:41
@jrevels
Copy link
Member Author

jrevels commented Jun 15, 2016

Everything here is now done, besides getting tests to pass on Travis - I have no idea what's going on there.

If anybody is curious about benchmarks, here's the difference between some benchmarks ran on master and this branch on Julia v0.4 (the data is located in the benchmarks folder so you can play with it yourself):

julia> using JLD, BenchmarkTools

julia> g41 = JLD.load("benchmark/results_julia_v04_forwarddiff_v01.jld", "results");

julia> g42 = JLD.load("benchmark/results_julia_v04_forwarddiff_v02.jld", "results");

julia> compare(a, b, k) = judge(minimum(a[k]), minimum(b[k]))
compare (generic function with 1 method)

julia> compare(g42, g41, "derivative")
6-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "deriv_test_6" => TrialJudgement(-55.51% => improvement)
  "deriv_test_4" => TrialJudgement(-95.39% => improvement)
  "deriv_test_5" => TrialJudgement(-96.40% => improvement)
  "deriv_test_2" => TrialJudgement(-94.97% => improvement)
  "deriv_test_1" => TrialJudgement(-88.74% => improvement)
  "deriv_test_3" => TrialJudgement(-89.70% => improvement)

julia> compare(g42, g41, "gradient")
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "rosenbrock" => TrialJudgement(-21.89% => improvement)
  "ackley" => TrialJudgement(+0.11% => invariant)
  "self_weighted_logit" => TrialJudgement(-36.30% => improvement)

julia> compare(g42, g41, "hessian")
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "rosenbrock" => TrialJudgement(-81.98% => improvement)
  "ackley" => TrialJudgement(-80.44% => improvement)
  "self_weighted_logit" => TrialJudgement(-78.50% => improvement)

julia> compare(g42, g41, "jacobian")
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "trigonometric" => TrialJudgement(-10.58% => improvement)
  "chebyquad" => TrialJudgement(-10.07% => improvement)
  "brown_almost_linear" => TrialJudgement(-9.19% => improvement)

@KristofferC
Copy link
Collaborator

That's what you get for using case insensitive file systems ;) File is called Partials.j but you are including partials.jl.

@jrevels
Copy link
Member Author

jrevels commented Jun 15, 2016

lol, thanks! I never would've realized that the local file's case didn't matching the remote file's case:

screen shot 2016-06-15 at 3 46 38 pm

@coveralls
Copy link

coveralls commented Jun 15, 2016

Coverage Status

Coverage decreased (-10.8%) to 80.484% when pulling 967a31a on giant-refactor into a65ac31 on master.

return ($F){N,R,switch_eltype(C, R)}
end
end
if v"0.4" <= VERSION < v"0.5-"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, please, please be more specific with VERSION dependent code like this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Removing the divisibility restriction on chunk_size
6 participants