-
-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce sciml_train #125
Conversation
Needs to get tested and all of that. All of our tests should change over to this function and style. The README should all make use of it as well. |
Codecov Report
@@ Coverage Diff @@
## master #125 +/- ##
==========================================
+ Coverage 72.91% 78.87% +5.95%
==========================================
Files 2 3 +1
Lines 48 71 +23
==========================================
+ Hits 35 56 +21
- Misses 13 15 +2
Continue to review full report at Codecov.
|
@pkofod is setting up the output with Optim's type a good idea? Also, is there a reason why the initial stepnorm is so sensitive? |
From the tests: TrackerAdjoint with ADAM: * Status: failure (reached maximum number of iterations)
* Candidate solution
Minimizer: [1.90e+00, 1.89e+00, 8.78e-01, ...]
Minimum: 6.817536e-03
* Found with
Algorithm: ADAM
Initial Point: [2.20e+00, 1.00e+00, 2.00e+00, ...]
* Convergence measures
|x - x'| = NaN ≰ 0.0e+00
|x - x'|/|x'| = NaN ≰ 0.0e+00
|f(x) - f(x')| = NaN ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = NaN ≰ 0.0e+00
|g(x)| = NaN ≰ 0.0e+00
* Work counters
Seconds run: 3 (vs limit Inf)
Iterations: 100
f(x) calls: 100
∇f(x) calls: 100 TrackerAdjoint with BFGS Optim * Status: failure (objective increased between iterations) (line search failed)
* Candidate solution
Minimizer: [1.96e+00, 1.96e+00, 1.70e+00, ...]
Minimum: 2.729345e-08
* Found with
Algorithm: BFGS
Initial Point: [2.20e+00, 1.00e+00, 2.00e+00, ...]
* Convergence measures
|x - x'| = 2.28e-06 ≰ 0.0e+00
|x - x'|/|x'| = 1.16e-06 ≰ 0.0e+00
|f(x) - f(x')| = 2.73e-14 ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = 1.00e-06 ≰ 0.0e+00
|g(x)| = 1.81e-03 ≰ 1.0e-08
* Work counters
Seconds run: 2 (vs limit Inf)
Iterations: 11
f(x) calls: 79
∇f(x) calls: 79 ForwardDiffSensitivity with ADAM * Status: failure (reached maximum number of iterations)
* Candidate solution
Minimizer: [1.75e+00, 1.72e+00, 1.18e+00, ...]
Minimum: 3.778992e-03
* Found with
Algorithm: ADAM
Initial Point: [2.20e+00, 1.00e+00, 2.00e+00, ...]
* Convergence measures
|x - x'| = NaN ≰ 0.0e+00
|x - x'|/|x'| = NaN ≰ 0.0e+00
|f(x) - f(x')| = NaN ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = NaN ≰ 0.0e+00
|g(x)| = NaN ≰ 0.0e+00
* Work counters
Seconds run: 1 (vs limit Inf)
Iterations: 100
f(x) calls: 100
∇f(x) calls: 100 with BFGS * Status: success
* Candidate solution
Minimizer: [1.85e+00, 1.85e+00, 1.22e+00, ...]
Minimum: 5.315246e-22
* Found with
Algorithm: BFGS
Initial Point: [2.20e+00, 1.00e+00, 2.00e+00, ...]
* Convergence measures
|x - x'| = 1.67e-09 ≰ 0.0e+00
|x - x'|/|x'| = 9.02e-10 ≰ 0.0e+00
|f(x) - f(x')| = 5.78e-17 ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = 1.09e+05 ≰ 0.0e+00
|g(x)| = 5.25e-11 ≤ 1.0e-08
* Work counters
Seconds run: 0 (vs limit Inf)
Iterations: 10
f(x) calls: 35
∇f(x) calls: 35 Adjoints with ADAM * Status: failure (reached maximum number of iterations)
* Candidate solution
Minimizer: [1.90e+00, 1.89e+00, 8.78e-01, ...]
Minimum: 6.817536e-03
* Found with
Algorithm: ADAM
Initial Point: [2.20e+00, 1.00e+00, 2.00e+00, ...]
* Convergence measures
|x - x'| = NaN ≰ 0.0e+00
|x - x'|/|x'| = NaN ≰ 0.0e+00
|f(x) - f(x')| = NaN ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = NaN ≰ 0.0e+00
|g(x)| = NaN ≰ 0.0e+00
* Work counters
Seconds run: 3 (vs limit Inf)
Iterations: 100
f(x) calls: 100
∇f(x) calls: 100 BFGS * Status: failure (objective increased between iterations) (line search failed)
* Candidate solution
Minimizer: [1.96e+00, 1.96e+00, 1.70e+00, ...]
Minimum: 2.729345e-08
* Found with
Algorithm: BFGS
Initial Point: [2.20e+00, 1.00e+00, 2.00e+00, ...]
* Convergence measures
|x - x'| = 2.28e-06 ≰ 0.0e+00
|x - x'|/|x'| = 1.16e-06 ≰ 0.0e+00
|f(x) - f(x')| = 2.73e-14 ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = 1.00e-06 ≰ 0.0e+00
|g(x)| = 1.81e-03 ≰ 1.0e-08
* Work counters
Seconds run: 2 (vs limit Inf)
Iterations: 11
f(x) calls: 79
∇f(x) calls: 79 Conclusion: BFGS is consistently about 5 orders of magnitude better with less effort. |
```julia using DiffEqFlux, OrdinaryDiffEq, Optim, Flux, Zygote, Test u0 = Float32[2.; 0.] datasize = 30 tspan = (0.0f0,1.5f0) function trueODEfunc(du,u,p,t) true_A = [-0.1 2.0; -2.0 -0.1] du .= ((u.^3)'true_A)' end t = range(tspan[1],tspan[2],length=datasize) prob = ODEProblem(trueODEfunc,u0,tspan) ode_data = Array(solve(prob,Tsit5(),saveat=t)) fastdudt2,p = FastChain((x,p) -> x.^3, FastDense(2,50,tanh), FastDense(50,2)) fast_n_ode = NeuralODE(fastdudt2,p,tspan,Tsit5(),saveat=t) function fast_predict_n_ode(p) fast_n_ode(u0,p) end function fast_loss_n_ode(p) pred = fast_predict_n_ode(p) loss = sum(abs2,ode_data .- pred) loss,pred end dudt2 = Chain((x) -> x.^3, Dense(2,50,tanh), Dense(50,2)) n_ode = NeuralODE(dudt2,tspan,Tsit5(),saveat=t) function predict_n_ode(p) n_ode(u0,p) end function loss_n_ode(p) pred = predict_n_ode(p) loss = sum(abs2,ode_data .- pred) loss,pred end _p,re = Flux.destructure(dudt2) @test fastdudt2(ones(2),_p) ≈ dudt2(ones(2)) @test fast_loss_n_ode(p)[1] ≈ loss_n_ode(p)[1] @test Zygote.gradient((p)->fast_loss_n_ode(p)[1], p)[1] ≈ Zygote.gradient((p)->loss_n_ode(p)[1], p)[1] @Btime Zygote.gradient((p)->fast_loss_n_ode(p)[1], p) @Btime Zygote.gradient((p)->fast_loss_n_ode(p)[1], p) @Btime Zygote.gradient((p)->loss_n_ode(p)[1], p) @Btime Zygote.gradient((p)->loss_n_ode(p)[1], p) ``` ``` 27.272 ms (181318 allocations: 16.54 MiB) 27.328 ms (181318 allocations: 16.54 MiB) 262.430 ms (677868 allocations: 32.83 MiB) 260.814 ms (677868 allocations: 32.83 MiB) ``` order of magnitude performance improvement over using Flux for neural networks
implement fast versions of Flux
Wrt this... The question is that before the second iteration we have no second order information (unless given), so the hessian approximation is just |
From what I can tell it almost always needs to be restricted. Would it be good to change the default to 0.01? |
You could. There have been various suggestions, but most descriptions assume that the initial step is the full gradient. Alternatively you can specify a preconditioner, you can supply curvature information through the initial inverse hessian approximation or yeah, an initial step norm. I never really benchmarked it. I only added the possibility because we looked at it in the Pumas context :) |
This is a starter PR for students interested in solving #120