Skip to content

Commit

Permalink
Optimizer module overhaul (apache#396)
Browse files Browse the repository at this point in the history
  • Loading branch information
iblislin authored Jan 31, 2018
1 parent 157b088 commit 9f4f533
Show file tree
Hide file tree
Showing 25 changed files with 772 additions and 584 deletions.
100 changes: 100 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@
* `Nadam`
* `RMSProp`
* `SGD`
* `getupdater()`
* `normgrad!()`
* `update!()`

* `AbstractDataProvider`
* `AbstractDataBatch`
Expand Down Expand Up @@ -344,6 +347,103 @@
Before: `clip(x, a_min = -4, a_max = 4)`
After: `clip(x, -4, 4)`

### Optimizer

We overhauled the optimizer APIs, introducing breaking changes.
There are tons of renaming, and we try to increase the flexibility.
Making it decouples from some high-level, so user can use it without
understand some detail implementations of `fit!`.

See #396.

* All the keyword argument of optimizers have been renamed.
Now we have more elegant keyword arguments than Python's,
thanks to well Unicode support on Julia's REPL and editor plugin.
*These are breaking changes, no deprecation warning.*

| old | new | comment |
|---------------------------|-----------|--------------------------------|
| `opts.lr` | `η` | type `\eta<tab>` in REPL |
| `opts.momentum` | `μ` | type `\mu<tab>` in REPL |
| `opts.grad_clip` | `clip` | type `\nabla<tab>c` in REPL |
| `opts.weight_decay` | `λ` | type `\lambda<tab>` in REPL |
| `opts.lr_schedular` | `η_sched` | type `\eta<tab>_sched` in REPL |
| `opts.momentum_schedular` | `μ_sched` | type `\mu<tab>_sched` in REPL |

For instance, one accessed the learning via `SGD().opts.lr`,
but now, it's `SGD().η`.

* New keyword argument `scale` for gradient rescaling.

Docstring:
```
If != 0, multiply the gradient with `∇r` before updating.
Often choose to be `1.0 / batch_size`.
If leave it default, high-level API like `fit!` will set it to
`1.0 / batch_size`, since `fit!` knows the `batch_size`.
```

* Keyword arguments of `NadamScheduler` has been renamed.
*This is a breaking change, no deprecation warning.*

* Before

```julia
NadamScheduler(; mu0 = 0.99, delta = 0.004, gamma = 0.5, alpha = 0.96)
```

* After

```julia
NadamScheduler(; μ = 0.99, δ = 0.004, γ = 0.5, α = 0.96)
```

* The attribute `optimizer.state` is removed.
`OptimizationState` is only used by high-level abstraction, like `fit!`.

* `LearningRate` scheduler API changes:

* `get_learning_rate` is removed.
Please use `Base.get` to get learning rate.

```julia
julia> sched = mx.LearningRate.Exp(.1)
MXNet.mx.LearningRate.Exp(0.1, 0.9, 0)
julia> get(sched)
0.1
julia> update!(sched);
julia> get(sched)
0.09000000000000001
```

* `update!` to bump counter of `Scheduler.t`
```julia
julia> sched.t
1
julia> update!(sched);
julia> sched.t
2
julia> update!(sched);
julia> sched.t
3
```

* `Momentum` module API changes:

* `get_momentum_scheduler` is removed. Please use `Base.get` instead.

```julia
julia> get(mx.Momentum.Fixed(.9))
0.9
```

----

# v0.3.0 (2017.11.16)
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ train_provider, eval_provider = get_mnist_providers(batch_size)
model = mx.FeedForward(mlp, context=mx.cpu())

# optimization algorithm
optimizer = mx.SGD(lr=0.1, momentum=0.9)
# where η is learning rate and μ is momentum
optimizer = mx.SGD=0.1, μ=0.9)

# fit parameters
mx.fit(model, optimizer, train_provider, n_epoch=20, eval_data=eval_provider)
Expand Down
16 changes: 16 additions & 0 deletions docs/src/api/optimizer.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,21 @@
# Optimizers

Says, you have the parameter `W` inited for your model and
got its gradient stored as `` (perhaps from AutoGrad APIs).
Here is minimal snippet of getting your parameter `W` baked by `SGD`.

```@repl
using MXNet
opt = SGD(η = 10)
decend! = getupdater(opt)
W = NDArray(Float32[1, 2, 3, 4]);
∇ = NDArray(Float32[.1, .2, .3, .4]);
decend!(1, ∇, W)
```

```@autodocs
Modules = [MXNet.mx, MXNet.mx.LearningRate, MXNet.mx.Momentum]
Pages = ["optimizer.jl"]
Expand Down
6 changes: 3 additions & 3 deletions docs/src/tutorial/mnist.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,10 +100,10 @@ help.

The last thing we need to specify is the optimization algorithm (a.k.a.
*optimizer*) to use. We use the basic SGD with a fixed learning rate 0.1
and momentum 0.9:
, momentum 0.9 and weight decay 0.00001:

```julia
optimizer = mx.SGD(lr=0.1, momentum=0.9, weight_decay=0.00001)
optimizer = mx.SGD(η=0.1, μ=0.9, λ=0.00001)
```

Now we can do the training. Here the `n_epoch` parameter specifies that
Expand Down Expand Up @@ -205,7 +205,7 @@ on GPU, and train it.
model = mx.FeedForward(lenet, context=mx.gpu())

# optimizer
optimizer = mx.SGD(lr=0.05, momentum=0.9, weight_decay=0.00001)
optimizer = mx.SGD(η=0.05, μ=0.9, λ=0.00001)

# fit parameters
mx.fit(model, optimizer, train_provider, n_epoch=20, eval_data=eval_provider)
Expand Down
4 changes: 2 additions & 2 deletions docs/src/user-guide/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,10 +147,10 @@ macroexpand(:(@mx.inplace a += b))
As we can see, it translate the `+=` operator to an explicit `add_to!`
function call, which invokes into libmxnet to add the contents of `b`
into `a` directly. For example, the following is the update rule in the
`SGD Optimizer` (both `grad` and `weight` are `NDArray` objects):
`SGD Optimizer` (both gradient `` and weight `W` are `NDArray` objects):

```julia
@inplace weight += -lr * (grad_scale * grad + self.weight_decay * weight)
@inplace W .+= -η .* (+ λ .* W)
```

Note there is no much magic in `mx.inplace`: it only does a shallow
Expand Down
2 changes: 1 addition & 1 deletion examples/char-lstm/train.jl
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ end

#--train
model = mx.FeedForward(lstm, context=context)
optimizer = mx.ADAM(lr=BASE_LR, weight_decay=WEIGHT_DECAY, grad_clip=CLIP_GRADIENT)
optimizer = mx.ADAM(η=BASE_LR, λ=WEIGHT_DECAY, clip=CLIP_GRADIENT)

mx.fit(model, optimizer, data_tr, eval_data=data_val, n_epoch=N_EPOCH,
initializer=mx.UniformInitializer(0.1),
Expand Down
2 changes: 1 addition & 1 deletion examples/cifar10/cifar10.jl
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ gpus = [mx.Context(mx.GPU, i) for i = 0:num_gpus-1]
model = mx.FeedForward(softmax, context=gpus)

# optimizer
optimizer = mx.SGD(lr=0.05, momentum=0.9, weight_decay=0.0001)
optimizer = mx.SGD(η=0.05, μ=0.9, λ=0.0001)

# fit parameters
mx.fit(model, optimizer, train_provider, n_epoch=num_epoch, eval_data=test_provider,
Expand Down
2 changes: 1 addition & 1 deletion examples/mnist/lenet-stn.jl
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ train_provider, eval_provider = get_mnist_providers(batch_size; flat=false)
model = mx.FeedForward(lenet, context=mx.cpu())

# optimizer
optimizer = mx.ADAM(lr=0.01, weight_decay=0.00001)
optimizer = mx.ADAM(η=0.01, λ=0.00001)

# fit parameters
initializer=mx.XavierInitializer(distribution = mx.xv_uniform, regularization = mx.xv_avg, magnitude = 1)
Expand Down
2 changes: 1 addition & 1 deletion examples/mnist/lenet.jl
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ train_provider, eval_provider = get_mnist_providers(batch_size; flat=false)
model = mx.FeedForward(lenet, context=mx.gpu())

# optimizer
optimizer = mx.SGD(lr=0.05, momentum=0.9, weight_decay=0.00001)
optimizer = mx.SGD(η=0.05, μ=0.9, λ=0.00001)

# fit parameters
mx.fit(model, optimizer, train_provider, n_epoch=20, eval_data=eval_provider)
9 changes: 8 additions & 1 deletion examples/mnist/mlp-test.jl
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,14 @@ end

function test_mnist_mlp()
info("MNIST::SGD")
@test mnist_fit_and_predict(mx.SGD(lr=0.1, momentum=0.9), mx.UniformInitializer(0.01), 2) > 90
@test mnist_fit_and_predict(mx.SGD=.2), mx.UniformInitializer(.01), 2) > 90

info("MNIST::SGD::η scheduler")
@test mnist_fit_and_predict(mx.SGD(η_sched=mx.LearningRate.Inv(.25)),
mx.UniformInitializer(.01), 2) > 90

info("MNIST::SGD::momentum μ")
@test mnist_fit_and_predict(mx.SGD=.1, μ=.9), mx.UniformInitializer(.01), 2) > 90

info("MNIST::ADAM")
@test mnist_fit_and_predict(mx.ADAM(), mx.NormalInitializer(), 2) > 90
Expand Down
2 changes: 1 addition & 1 deletion examples/mnist/mlp.jl
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ train_provider, eval_provider = get_mnist_providers(batch_size)
model = mx.FeedForward(mlp, context=mx.cpu())

# optimizer
optimizer = mx.SGD(lr=0.1, momentum=0.9, weight_decay=0.00001)
optimizer = mx.SGD(η=0.1, μ=0.9, λ=0.00001)

# fit parameters
mx.fit(model, optimizer, train_provider, eval_data=eval_provider, n_epoch=20)
Expand Down
2 changes: 1 addition & 1 deletion examples/regression-example.jl
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ net = @mx.chain mx.Variable(:data) =>
model = mx.FeedForward(net, context=mx.cpu())

# set up the optimizer: select one, explore parameters, if desired
#optimizer = mx.SGD(lr=0.01, momentum=0.9, weight_decay=0.00001)
#optimizer = mx.SGD(η=0.01, μ=0.9, λ=0.00001)
optimizer = mx.ADAM()

# train, reporting loss for training and evaluation sets
Expand Down
5 changes: 4 additions & 1 deletion src/MXNet.jl
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,10 @@ export AbstractOptimizer,
AdaMax,
Nadam,
RMSProp,
SGD
SGD,
getupdater,
normgrad!,
update!

# io.jl
export AbstractDataProvider,
Expand Down
1 change: 1 addition & 0 deletions src/base.jl
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ end
# NTuple{N, Int} passed to libmxnet.
#
# TODO: find a better solution in case this cause issues in the future.
# I made `@_remap` in `ndarray.jl`. (Iblis Lin)
################################################################################
dump_mx_param(val::Any) = string(val)
dump_mx_param(val::Float64) = @sprintf("%.16e", val)
Expand Down
2 changes: 1 addition & 1 deletion src/kvstore.jl
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,6 @@ function set_optimizer(self :: KVStore, optimizer :: AbstractOptimizer)
if ismatch(r"dist", string(get_type(self))) && is_worker
# TODO
else
set_updater(self, get_updater(optimizer))
set_updater(self, getupdater(optimizer))
end
end
24 changes: 16 additions & 8 deletions src/model.jl
Original file line number Diff line number Diff line change
Expand Up @@ -286,7 +286,8 @@ end
kvstore :: Union{Symbol, KVStore} = :local,
force_init :: Bool = false,
callbacks :: Vector{AbstractCallback} = AbstractCallback[],
verbosity :: Int = 3
verbosity :: Int = 3,
η_decay :: Symbol = :epoch,
)

function _invoke_callbacks(self::FeedForward, callbacks::Vector{AbstractCallback},
Expand All @@ -309,12 +310,11 @@ end
Alias to [`fit`](@ref).
"""
function train(self :: FeedForward, optimizer :: AbstractOptimizer, data :: AbstractDataProvider; kwargs...)
fit(self, optimizer, data; kwargs...)
end
train(m::FeedForward, opt::AbstractOptimizer, data::AbstractDataProvider; kw...) =
fit(m, opt, data; kw...)

"""
fit(model :: FeedForward, optimizer, data; kwargs...)
fit(model::FeedForward, optimizer, data; kwargs...)
Train the `model` on `data` with the `optimizer`.
Expand Down Expand Up @@ -343,6 +343,7 @@ Train the `model` on `data` with the `optimizer`.
- `1`: Print starting and final messages
- `2`: Print one time messages and a message at the start of each epoch
- `3`: Print a summary of the training and validation accuracy for each epoch
* `η_decay::Symbol`: `:epoch` or `:batch`, decay learning rate on epoch or batch.
"""
function fit(self::FeedForward, optimizer::AbstractOptimizer, data::AbstractDataProvider;
kwargs...)
Expand Down Expand Up @@ -418,10 +419,11 @@ function fit(self::FeedForward, optimizer::AbstractOptimizer, data::AbstractData
aux_arrays = [NDArray[exec.aux_arrays[i] for exec in train_execs] for i = 1:length(aux_names)]

op_state = OptimizationState(batch_size)
optimizer.state = op_state
# set up the gradient rescaling if user not set
iszero(optimizer.scale) && (optimizer.scale = 1 / batch_size)

if !update_on_kvstore
updater = get_updater(optimizer)
updater = getupdater(optimizer)
end

if !isa(kvstore, Void)
Expand Down Expand Up @@ -481,7 +483,6 @@ function fit(self::FeedForward, optimizer::AbstractOptimizer, data::AbstractData

op_state.curr_iter += 1
op_state.curr_batch += 1
optimizer.state = op_state

# update parameters
for idx = 1:length(param_names)
Expand Down Expand Up @@ -514,6 +515,9 @@ function fit(self::FeedForward, optimizer::AbstractOptimizer, data::AbstractData
end
end

# trigger learning rate decay
opts.η_decay == :batch && update!(optimizer.η_sched)

# invoke callbacks after finishing each iteration
_invoke_callbacks(self, opts.callbacks, op_state, AbstractBatchCallback)

Expand Down Expand Up @@ -577,6 +581,10 @@ function fit(self::FeedForward, optimizer::AbstractOptimizer, data::AbstractData
copy!(self.aux_params[name], aux_avg)
end
end

# trigger learning rate decay
opts.η_decay == :epoch && update!(optimizer.η_sched)

_invoke_callbacks(self, opts.callbacks, op_state, AbstractEpochCallback; metric=metric)
end # end of all epochs

Expand Down
Loading

0 comments on commit 9f4f533

Please sign in to comment.