Solver switching support & implementation of Nesterov's accelerated grad... #741

qipeng · 2014-07-19T21:15:08Z

...ient and AdaGrad

jeffdonahue · 2014-07-19T22:33:44Z

include/caffe/solver.hpp

-  virtual void SnapshotSolverState(SolverState * state);
-  virtual void RestoreSolverState(const SolverState& state);
+  void SnapshotSolverState(SolverState * state);
+  void RestoreSolverState(const SolverState& state);


why make any of these three methods non-virtual? other solvers might have more/other state to store, other PreSolve steps.

In fact this change does nothing; these methods are all implicitly virtual, and the virtual keyword should be restored (see #590).

@jeffdonahue @longjon Thanks for the helpful comments! I've restored the virtual keywords accordingly.

jeffdonahue · 2014-07-20T20:11:13Z

Hey @qipeng, thanks for this PR -- this looks like really great work! While I'll be excited to see NAG & AdaGrad in Caffe, I'm inclined to hold any new code to a higher standard than existing code in terms of testing. Would you be willing to write some unit tests for your new solvers, e.g. checking that a known correct update value is computed for some small problems? You could refer to the existing NetTests for ideas on how to design/implement these, but if it would help move things along, I'd be happy to write a couple tests for the existing SGDSolver that you could add on to. (In the meantime, I assume @qipeng will keep his solvers branch publicly available in his fork for anyone who wants to play around with them.)

kloudkl · 2014-07-21T00:49:20Z

@qipeng, please let me know if you solved #30 and/or #53 for me to decide whether to close those issues.

qipeng · 2014-07-21T05:15:58Z

Hi @jeffdonahue , thanks for the comments! Since I'm just starting to use/contribute to this code base, I might not be familiar with writing proper test cases for solvers but I'll definitely try to. That said, it would be really helpful if you could spare a small amount of time to write a test for SGDSolver. And yes, I intend to keep my solvers branch public and alive so others in the community can feel free to use them.

qipeng · 2014-07-21T05:28:10Z

Hi @kloudkl , I read both issues and they are really insightful.

As for AdaGrad, my implementation was according to Duchi's original paper on AdaGrad, without the fancy variations. But I don't imagine it to be difficult if the variants are not radically different --- if you could point me to their respective references, I'll try to read them and extend this solver further on my branch.

As for NAG, my implementation referred to Ilya Sutskever's PhD thesis, which should be the same as the equations you listed in #53 with only one exception: I'm working within the Solver framework that Caffe now has, which only allows me to alternate between gradient evaluation and parameter update. Hence to implement the simplified Nesterov's Accelerated Gradient, I used the update of (1+momentum) times the update, evaluated the gradient there, and stepped back by -momentum times the update from the previous step then over-step again. Technically the last update is inexact (it overstepped by momentum times that update), but as most functions have tiny gradient (hence updates) near convergence, and that we usually anneal the learning rate near the minimum, this wouldn't be a huge issue in practice. So unless you feel the urge to correct the last over-step (which can be fixed by adding something like a PostSolve function to the existing Solver framework tailored for NAG, I think #53 can be considered solved.

kloudkl · 2014-07-21T09:34:10Z

Let's be agile. Before the original AdaGrad works correctly, it's a low priority to implement any other variant version.

Yangqing · 2014-07-25T05:29:56Z

include/caffe/solver.hpp

+}
+
+template Solver<float>* GetSolver(const SolverParameter& param);
+template Solver<double>* GetSolver(const SolverParameter& param);



Note that if a template function is defined in the header file, you won't need the above two lines.

Thanks for the comments! It's been addressed in a recent commit.

bhack · 2014-07-30T17:33:31Z

@qipeng see also this post

shelhamer · 2014-08-12T19:11:40Z

@qipeng this is really nice work that I look forward to merging. Can you rebase and add tests now that @jeffdonahue has given an example by the SGDSolver tests in #855? Thanks!

qipeng · 2014-08-21T03:16:22Z

@shelhamer I just made a commit including solver unit tests for Nesterov and AdaGrad. Some code are commented out, and here are some explanations:

It's a lot more difficult to verify the history of Nesterov and AdaGrad. These histories cannot be directly (by that I mean easily) computed from the solver update;
For AdaGrad it doesn't make sense to check momentum.

Also it should be noted that given the framework of caffe where only one gradient evaluation and one update can be performed in each iteration, I implemented Nesterov in a slightly different way (I call it "step back and over-step" in the code), so basically the only problem is that after the solver finishes solving, the parameter is at x_t+mom*v_t instead of x_t defined in the original algorithm. This shouldn't be a huge problem in practice though, since near the optimum the gradients should be close to zero.

jeffdonahue · 2014-08-22T22:15:13Z

Great work @qipeng! I consolidated the tests into one file to avoid duplicating so much code -- I made a PR to your branch.

I'd like to see a couple more things before merging this:

You commented out the part of CheckLeastSquaresUpdate that checks the value stored in history_, presumably because you aren't storing the actual update history for these solvers. It's fine if the history in your solvers doesn't match SGD's history, but you should still check that its value is in fact whatever you expect it to be; otherwise the "inductive" principle behind the tests doesn't hold up, since the history value from iter K is assumed to be correct when computing the update value at K+1.
Since momentum is not used in the AdaGradSolver, please add something like CHECK_EQ(0, momentum) << "Momentum cannot be used with AdaGrad."; somewhere in the AdaGrad code to die if someone tries to use it.
Please add an example for each of the new solvers -- MNIST is fine. The autoencoder is sort of a canonical example of a difficult-to-optimize deep architecture and Sutskever et al. [1] showed NAG worked well for it, so I think that would be a good choice. If it's too hard to get working (ideally better than the SGD example) though, LeNet is fine.

[1] http://www.cs.toronto.edu/~fritz/absps/momentum.pdf

jeffdonahue · 2014-08-22T22:19:54Z

oops, sorry about my first TODO bullet point; I just reread comment (1) from your previous post -- I'll think about it a bit.

qipeng · 2014-08-26T18:19:55Z

@jeffdonahue Thanks for the comments! I've added sample scripts for MNIST, as well as a sanity check of momentum for AdaGradSolver in the last commit.

jeffdonahue · 2014-08-26T18:33:55Z

Thanks @qipeng! Made a quick pass through your added code and made a comment, but will do a more thorough (and quite possibly final!) review later.

qipeng · 2014-08-26T19:04:09Z

@jeffdonahue Thanks for the careful review! Turns out after caffe's main executable's been unified, I forgot to add solver switch to that code, so I've been testing SGDSolver on that .prototxt file for a while. Now everything should be fixed. :)

…radient and AdaGrad

TestGradientBasedSolver

…ad MNIST example

jeffdonahue · 2014-09-01T21:27:48Z

Thanks for the fixes! This is merged into dev.

I did a bit of grooming on the MNIST autoencoder example, including of the existing mnist_autoencoder.prototxt:

-use the new MNIST LMDBs rather than LevelDBs
-test on the training set (actually possible to correctly do now thanks to MNIST LMDBs)
-compute both L2 and cross-entropy loss (but still optimize only cross-entropy loss) at both train and test time

I also changed the existing MNIST autoencoder SGD solver to work much better, starting at a learning rate 1e-2 with a step policy, instead of fixed at 1e-4, and made the symmetric change to the Nesterov solver (the adagrad solver already uses base_lr: 0.01 and adjusts its own learning rate), and stopping at iteration 65K instead of 4 million or whatever it was before.

I also updated the training scripts to run from the root directory per the new convention.

After all those fixes, I also added random_seed: 1701 to the bottom of all 3 solvers and compared their performance (noting that the initial error numbers for the train & test nets were all the same as expected due to the seed). Here are the final results, with Nesterov doing the best overall (by both error metrics), followed closely by SGD, and with AdaGrad in last place (but also quite close, which IMO is impressive considering I manually set the stepsize for SGD and Nesterov solvers, unlike AdaGrad where only a fixed LR is specified):

AdaGrad:

I0901 13:36:30.007884 24952 solver.cpp:232] Iteration 65000, loss = 64.1627
I0901 13:36:30.007922 24952 solver.cpp:251] Iteration 65000, Testing net (#0) # train set
I0901 13:36:33.019305 24952 solver.cpp:289] Test loss: 63.217
I0901 13:36:33.019356 24952 solver.cpp:302]     Test net output #0: cross_entropy_loss = 63.217 (* 1 = 63.217 loss)
I0901 13:36:33.019773 24952 solver.cpp:302]     Test net output #1: l2_error = 2.40951
I0901 13:36:33.019785 24952 solver.cpp:251] Iteration 65000, Testing net (#1) # test set
I0901 13:36:33.462723 24952 solver.cpp:289] Test loss: 62.9406
I0901 13:36:33.462762 24952 solver.cpp:302]     Test net output #0: cross_entropy_loss = 62.9406 (* 1 = 62.9406 loss)
I0901 13:36:33.462770 24952 solver.cpp:302]     Test net output #1: l2_error = 2.41202

SGD:

I0901 13:35:20.426187 20072 solver.cpp:232] Iteration 65000, loss = 61.5498
I0901 13:35:20.426218 20072 solver.cpp:251] Iteration 65000, Testing net (#0) # train set
I0901 13:35:22.780092 20072 solver.cpp:289] Test loss: 60.8301
I0901 13:35:22.780138 20072 solver.cpp:302]     Test net output #0: cross_entropy_loss = 60.8301 (* 1 = 60.8301 loss)
I0901 13:35:22.780146 20072 solver.cpp:302]     Test net output #1: l2_error = 2.02321
I0901 13:35:22.780153 20072 solver.cpp:251] Iteration 65000, Testing net (#1) # test set
I0901 13:35:23.225303 20072 solver.cpp:289] Test loss: 60.6859
I0901 13:35:23.225347 20072 solver.cpp:302]     Test net output #0: cross_entropy_loss = 60.6859 (* 1 = 60.6859 loss)
I0901 13:35:23.225354 20072 solver.cpp:302]     Test net output #1: l2_error = 2.0505

Nesterov:

I0901 13:36:52.466069 22488 solver.cpp:232] Iteration 65000, loss = 59.9389
I0901 13:36:52.466099 22488 solver.cpp:251] Iteration 65000, Testing net (#0) # train set
I0901 13:36:55.068370 22488 solver.cpp:289] Test loss: 59.3663
I0901 13:36:55.068410 22488 solver.cpp:302]     Test net output #0: cross_entropy_loss = 59.3663 (* 1 = 59.3663 loss)
I0901 13:36:55.068418 22488 solver.cpp:302]     Test net output #1: l2_error = 1.79998
I0901 13:36:55.068425 22488 solver.cpp:251] Iteration 65000, Testing net (#1) # test set
I0901 13:36:55.583389 22488 solver.cpp:289] Test loss: 59.3155
I0901 13:36:55.583426 22488 solver.cpp:302]     Test net output #0: cross_entropy_loss = 59.3155 (* 1 = 59.3155 loss)
I0901 13:36:55.583434 22488 solver.cpp:302]     Test net output #1: l2_error = 1.84289

shelhamer · 2014-09-01T21:40:43Z

@qipeng thanks for the solvers and @jeffdonahue thanks for the grooming.

@jeffdonahue Ideally you should keep the same format for merge commit messages i.e. "Merge pull request #741 from qipeng/solvers" to make it easier to go back and find the thread for a merge.

jeffdonahue · 2014-09-01T21:55:27Z

train_net.cpp is reverted (re-deprecated) in dev. sorry about the merge commit message; will try to keep in github format during manual merges in the future.

qipeng mentioned this pull request Jul 19, 2014

Added Solver Switch Support... #731

Closed

jeffdonahue reviewed Jul 19, 2014
View reviewed changes

qipeng closed this Jul 21, 2014

qipeng reopened this Jul 21, 2014

This was referenced Jul 21, 2014

Implement simpliﬁed Nesterov momentum #53

Closed

Implement adaptive learning rate #30

Closed

Yangqing reviewed Jul 25, 2014
View reviewed changes

jeffdonahue mentioned this pull request Aug 4, 2014

SGDSolver tests #855

Merged

qipeng force-pushed the solvers branch from 451f846 to 18fb88b Compare August 20, 2014 21:40

shelhamer assigned jeffdonahue Aug 22, 2014

qipeng force-pushed the solvers branch from 6c30908 to 8c40ec5 Compare August 26, 2014 06:03

shelhamer force-pushed the dev branch 3 times, most recently from 4278286 to c01f07a Compare August 28, 2014 07:00

qipeng force-pushed the solvers branch from 683a039 to 6db4e56 Compare August 28, 2014 22:26

qipeng force-pushed the solvers branch from e1936ad to 234accf Compare August 31, 2014 00:10

qipeng and others added 19 commits August 31, 2014 23:07

Solver switching support & implementation of Nesterov's accelerated g…

0af28e2

…radient and AdaGrad

restored vituals in solver.hpp

3295e2b

converted pointers to shared_ptr

99f32de

fixed solver constructor in train_net.cpp

a71dfce

improved numerical stability for AdaGrad

f987c76

bugfixes for AdaGrad

214e72a

Added L1 regularization support for the weights

47c3c81

fixed caffe.proto after a mistaken rebase

463f760

Addressed Yangqing's comments

a457348

fixes after rebase

277508c

proto conflit, lint, and math_functions (compiler complaint)

cafe8bc

added unit test for solvers and fixed solver bugs

73243d0

cleanup caffe.proto

1c18d68

Merge Test{SGD,AdaGrad,Nesterov}Solver; they become subclasses of

b557588

TestGradientBasedSolver

Added sanity check for AdaGradSolver; added MNIST examples for solvers

9832ce0

lint

d966ad0

Re-added solver switch into the new caffe main excutable; fixed AdaGr…

acbc667

…ad MNIST example

lint

9a69b42

hot fix for warning

77b7430

qipeng force-pushed the solvers branch from 234accf to 77b7430 Compare September 1, 2014 06:08

jeffdonahue closed this Sep 1, 2014

to3i mentioned this pull request Sep 2, 2014

NesterovSolverTest fails in one case #1024

Closed

cancan101 mentioned this pull request Sep 18, 2014

AdaDelta Solver #1101

Closed

This was referenced Sep 18, 2014

[cancelled] Next #1109

Merged

Next: release candidate #1112

Merged

mohomran mentioned this pull request Sep 21, 2014

Adadelta #1122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solver switching support & implementation of Nesterov's accelerated grad... #741

Solver switching support & implementation of Nesterov's accelerated grad... #741

qipeng commented Jul 19, 2014

jeffdonahue Jul 19, 2014

longjon Jul 19, 2014

qipeng Jul 20, 2014

jeffdonahue commented Jul 20, 2014

kloudkl commented Jul 21, 2014

qipeng commented Jul 21, 2014

qipeng commented Jul 21, 2014

kloudkl commented Jul 21, 2014

Yangqing Jul 25, 2014

qipeng Jul 30, 2014

bhack commented Jul 30, 2014

shelhamer commented Aug 12, 2014

qipeng commented Aug 21, 2014

jeffdonahue commented Aug 22, 2014

jeffdonahue commented Aug 22, 2014

qipeng commented Aug 26, 2014

jeffdonahue commented Aug 26, 2014

qipeng commented Aug 26, 2014

jeffdonahue commented Sep 1, 2014

shelhamer commented Sep 1, 2014

jeffdonahue commented Sep 1, 2014

Solver switching support & implementation of Nesterov's accelerated grad... #741

Solver switching support & implementation of Nesterov's accelerated grad... #741

Conversation

qipeng commented Jul 19, 2014

jeffdonahue Jul 19, 2014

Choose a reason for hiding this comment

longjon Jul 19, 2014

Choose a reason for hiding this comment

qipeng Jul 20, 2014

Choose a reason for hiding this comment

jeffdonahue commented Jul 20, 2014

kloudkl commented Jul 21, 2014

qipeng commented Jul 21, 2014

qipeng commented Jul 21, 2014

kloudkl commented Jul 21, 2014

Yangqing Jul 25, 2014

Choose a reason for hiding this comment

qipeng Jul 30, 2014

Choose a reason for hiding this comment

bhack commented Jul 30, 2014

shelhamer commented Aug 12, 2014

qipeng commented Aug 21, 2014

jeffdonahue commented Aug 22, 2014

jeffdonahue commented Aug 22, 2014

qipeng commented Aug 26, 2014

jeffdonahue commented Aug 26, 2014

qipeng commented Aug 26, 2014

jeffdonahue commented Sep 1, 2014

shelhamer commented Sep 1, 2014

jeffdonahue commented Sep 1, 2014