Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce sciml_train #125

Merged
merged 16 commits into from
Feb 2, 2020
Merged

Introduce sciml_train #125

merged 16 commits into from
Feb 2, 2020

Conversation

ChrisRackauckas
Copy link
Member

This is a starter PR for students interested in solving #120

@ChrisRackauckas
Copy link
Member Author

Needs to get tested and all of that. All of our tests should change over to this function and style. The README should all make use of it as well.

@codecov
Copy link

codecov bot commented Feb 2, 2020

Codecov Report

Merging #125 into master will increase coverage by 5.95%.
The diff coverage is 93.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #125      +/-   ##
==========================================
+ Coverage   72.91%   78.87%   +5.95%     
==========================================
  Files           2        3       +1     
  Lines          48       71      +23     
==========================================
+ Hits           35       56      +21     
- Misses         13       15       +2
Impacted Files Coverage Δ
src/DiffEqFlux.jl 23.07% <ø> (ø) ⬆️
src/neural_de.jl 91.42% <100%> (ø) ⬆️
src/train.jl 91.3% <91.3%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2085732...0973536. Read the comment docs.

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Feb 2, 2020

@pkofod is setting up the output with Optim's type a good idea? Also, is there a reason why the initial stepnorm is so sensitive?

@ChrisRackauckas
Copy link
Member Author

ChrisRackauckas commented Feb 2, 2020

From the tests:

TrackerAdjoint with ADAM:

 * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Minimizer: [1.90e+00, 1.89e+00, 8.78e-01,  ...]
    Minimum:   6.817536e-03

 * Found with
    Algorithm:     ADAM
    Initial Point: [2.20e+00, 1.00e+00, 2.00e+00,  ...]

 * Convergence measures
    |x - x'|               = NaN  0.0e+00
    |x - x'|/|x'|          = NaN  0.0e+00
    |f(x) - f(x')|         = NaN  0.0e+00
    |f(x) - f(x')|/|f(x')| = NaN  0.0e+00
    |g(x)|                 = NaN  0.0e+00

 * Work counters
    Seconds run:   3  (vs limit Inf)
    Iterations:    100
    f(x) calls:    100
    ∇f(x) calls:   100

TrackerAdjoint with BFGS Optim

 * Status: failure (objective increased between iterations) (line search failed)

 * Candidate solution
    Minimizer: [1.96e+00, 1.96e+00, 1.70e+00,  ...]
    Minimum:   2.729345e-08

 * Found with
    Algorithm:     BFGS
    Initial Point: [2.20e+00, 1.00e+00, 2.00e+00,  ...]

 * Convergence measures
    |x - x'|               = 2.28e-06  0.0e+00
    |x - x'|/|x'|          = 1.16e-06  0.0e+00
    |f(x) - f(x')|         = 2.73e-14  0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.00e-06  0.0e+00
    |g(x)|                 = 1.81e-03  1.0e-08

 * Work counters
    Seconds run:   2  (vs limit Inf)
    Iterations:    11
    f(x) calls:    79
    ∇f(x) calls:   79

ForwardDiffSensitivity with ADAM

 * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Minimizer: [1.75e+00, 1.72e+00, 1.18e+00,  ...]
    Minimum:   3.778992e-03

 * Found with
    Algorithm:     ADAM
    Initial Point: [2.20e+00, 1.00e+00, 2.00e+00,  ...]

 * Convergence measures
    |x - x'|               = NaN  0.0e+00
    |x - x'|/|x'|          = NaN  0.0e+00
    |f(x) - f(x')|         = NaN  0.0e+00
    |f(x) - f(x')|/|f(x')| = NaN  0.0e+00
    |g(x)|                 = NaN  0.0e+00

 * Work counters
    Seconds run:   1  (vs limit Inf)
    Iterations:    100
    f(x) calls:    100
    ∇f(x) calls:   100

with BFGS

 * Status: success

 * Candidate solution
    Minimizer: [1.85e+00, 1.85e+00, 1.22e+00,  ...]
    Minimum:   5.315246e-22

 * Found with
    Algorithm:     BFGS
    Initial Point: [2.20e+00, 1.00e+00, 2.00e+00,  ...]

 * Convergence measures
    |x - x'|               = 1.67e-09  0.0e+00
    |x - x'|/|x'|          = 9.02e-10  0.0e+00
    |f(x) - f(x')|         = 5.78e-17  0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.09e+05  0.0e+00
    |g(x)|                 = 5.25e-11  1.0e-08

 * Work counters
    Seconds run:   0  (vs limit Inf)
    Iterations:    10
    f(x) calls:    35
    ∇f(x) calls:   35

Adjoints with ADAM

 * Status: failure (reached maximum number of iterations)

 * Candidate solution
    Minimizer: [1.90e+00, 1.89e+00, 8.78e-01,  ...]
    Minimum:   6.817536e-03

 * Found with
    Algorithm:     ADAM
    Initial Point: [2.20e+00, 1.00e+00, 2.00e+00,  ...]

 * Convergence measures
    |x - x'|               = NaN  0.0e+00
    |x - x'|/|x'|          = NaN  0.0e+00
    |f(x) - f(x')|         = NaN  0.0e+00
    |f(x) - f(x')|/|f(x')| = NaN  0.0e+00
    |g(x)|                 = NaN  0.0e+00

 * Work counters
    Seconds run:   3  (vs limit Inf)
    Iterations:    100
    f(x) calls:    100
    ∇f(x) calls:   100

BFGS

 * Status: failure (objective increased between iterations) (line search failed)

 * Candidate solution
    Minimizer: [1.96e+00, 1.96e+00, 1.70e+00,  ...]
    Minimum:   2.729345e-08

 * Found with
    Algorithm:     BFGS
    Initial Point: [2.20e+00, 1.00e+00, 2.00e+00,  ...]

 * Convergence measures
    |x - x'|               = 2.28e-06  0.0e+00
    |x - x'|/|x'|          = 1.16e-06  0.0e+00
    |f(x) - f(x')|         = 2.73e-14  0.0e+00
    |f(x) - f(x')|/|f(x')| = 1.00e-06  0.0e+00
    |g(x)|                 = 1.81e-03  1.0e-08

 * Work counters
    Seconds run:   2  (vs limit Inf)
    Iterations:    11
    f(x) calls:    79
    ∇f(x) calls:   79

Conclusion: BFGS is consistently about 5 orders of magnitude better with less effort.

```julia
using DiffEqFlux, OrdinaryDiffEq, Optim, Flux, Zygote, Test

u0 = Float32[2.; 0.]
datasize = 30
tspan = (0.0f0,1.5f0)

function trueODEfunc(du,u,p,t)
    true_A = [-0.1 2.0; -2.0 -0.1]
    du .= ((u.^3)'true_A)'
end
t = range(tspan[1],tspan[2],length=datasize)
prob = ODEProblem(trueODEfunc,u0,tspan)
ode_data = Array(solve(prob,Tsit5(),saveat=t))

fastdudt2,p = FastChain((x,p) -> x.^3,
             FastDense(2,50,tanh),
             FastDense(50,2))
fast_n_ode = NeuralODE(fastdudt2,p,tspan,Tsit5(),saveat=t)

function fast_predict_n_ode(p)
  fast_n_ode(u0,p)
end

function fast_loss_n_ode(p)
    pred = fast_predict_n_ode(p)
    loss = sum(abs2,ode_data .- pred)
    loss,pred
end

dudt2 = Chain((x) -> x.^3,
             Dense(2,50,tanh),
             Dense(50,2))
n_ode = NeuralODE(dudt2,tspan,Tsit5(),saveat=t)

function predict_n_ode(p)
  n_ode(u0,p)
end

function loss_n_ode(p)
    pred = predict_n_ode(p)
    loss = sum(abs2,ode_data .- pred)
    loss,pred
end

_p,re = Flux.destructure(dudt2)
@test fastdudt2(ones(2),_p) ≈ dudt2(ones(2))
@test fast_loss_n_ode(p)[1] ≈ loss_n_ode(p)[1]
@test Zygote.gradient((p)->fast_loss_n_ode(p)[1], p)[1] ≈ Zygote.gradient((p)->loss_n_ode(p)[1], p)[1]

@Btime Zygote.gradient((p)->fast_loss_n_ode(p)[1], p)
@Btime Zygote.gradient((p)->fast_loss_n_ode(p)[1], p)
@Btime Zygote.gradient((p)->loss_n_ode(p)[1], p)
@Btime Zygote.gradient((p)->loss_n_ode(p)[1], p)
```

```
  27.272 ms (181318 allocations: 16.54 MiB)
  27.328 ms (181318 allocations: 16.54 MiB)
  262.430 ms (677868 allocations: 32.83 MiB)
  260.814 ms (677868 allocations: 32.83 MiB)
```

order of magnitude performance improvement over using Flux for neural networks
@ChrisRackauckas ChrisRackauckas changed the title [WIP] Introduce sciml_train Introduce sciml_train Feb 2, 2020
@ChrisRackauckas ChrisRackauckas merged commit 2bd7091 into master Feb 2, 2020
@ChrisRackauckas ChrisRackauckas deleted the sciml branch February 2, 2020 18:29
@pkofod
Copy link

pkofod commented Feb 5, 2020

Also, is there a reason why the initial stepnorm is so sensitive?

Wrt this... The question is that before the second iteration we have no second order information (unless given), so the hessian approximation is just I. This means that it will attempt to take the step -gradient and that can take you into funky regions. This is why we're also restricting the initial step in Pumas.

@ChrisRackauckas
Copy link
Member Author

From what I can tell it almost always needs to be restricted. Would it be good to change the default to 0.01?

@pkofod
Copy link

pkofod commented Feb 5, 2020

You could. There have been various suggestions, but most descriptions assume that the initial step is the full gradient. Alternatively you can specify a preconditioner, you can supply curvature information through the initial inverse hessian approximation or yeah, an initial step norm. I never really benchmarked it. I only added the possibility because we looked at it in the Pumas context :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants