Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solver switching support & implementation of Nesterov's accelerated grad... #741

Closed
wants to merge 19 commits into from

Conversation

qipeng
Copy link
Contributor

@qipeng qipeng commented Jul 19, 2014

...ient and AdaGrad

virtual void SnapshotSolverState(SolverState * state);
virtual void RestoreSolverState(const SolverState& state);
void SnapshotSolverState(SolverState * state);
void RestoreSolverState(const SolverState& state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make any of these three methods non-virtual? other solvers might have more/other state to store, other PreSolve steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact this change does nothing; these methods are all implicitly virtual, and the virtual keyword should be restored (see #590).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffdonahue @longjon Thanks for the helpful comments! I've restored the virtual keywords accordingly.

@jeffdonahue
Copy link
Contributor

Hey @qipeng, thanks for this PR -- this looks like really great work! While I'll be excited to see NAG & AdaGrad in Caffe, I'm inclined to hold any new code to a higher standard than existing code in terms of testing. Would you be willing to write some unit tests for your new solvers, e.g. checking that a known correct update value is computed for some small problems? You could refer to the existing NetTests for ideas on how to design/implement these, but if it would help move things along, I'd be happy to write a couple tests for the existing SGDSolver that you could add on to. (In the meantime, I assume @qipeng will keep his solvers branch publicly available in his fork for anyone who wants to play around with them.)

@kloudkl
Copy link
Contributor

kloudkl commented Jul 21, 2014

@qipeng, please let me know if you solved #30 and/or #53 for me to decide whether to close those issues.

@qipeng
Copy link
Contributor Author

qipeng commented Jul 21, 2014

Hi @jeffdonahue , thanks for the comments! Since I'm just starting to use/contribute to this code base, I might not be familiar with writing proper test cases for solvers but I'll definitely try to. That said, it would be really helpful if you could spare a small amount of time to write a test for SGDSolver. And yes, I intend to keep my solvers branch public and alive so others in the community can feel free to use them.

@qipeng qipeng closed this Jul 21, 2014
@qipeng qipeng reopened this Jul 21, 2014
@qipeng
Copy link
Contributor Author

qipeng commented Jul 21, 2014

Hi @kloudkl , I read both issues and they are really insightful.

As for AdaGrad, my implementation was according to Duchi's original paper on AdaGrad, without the fancy variations. But I don't imagine it to be difficult if the variants are not radically different --- if you could point me to their respective references, I'll try to read them and extend this solver further on my branch.

As for NAG, my implementation referred to Ilya Sutskever's PhD thesis, which should be the same as the equations you listed in #53 with only one exception: I'm working within the Solver framework that Caffe now has, which only allows me to alternate between gradient evaluation and parameter update. Hence to implement the simplified Nesterov's Accelerated Gradient, I used the update of (1+momentum) times the update, evaluated the gradient there, and stepped back by -momentum times the update from the previous step then over-step again. Technically the last update is inexact (it overstepped by momentum times that update), but as most functions have tiny gradient (hence updates) near convergence, and that we usually anneal the learning rate near the minimum, this wouldn't be a huge issue in practice. So unless you feel the urge to correct the last over-step (which can be fixed by adding something like a PostSolve function to the existing Solver framework tailored for NAG, I think #53 can be considered solved.

@kloudkl
Copy link
Contributor

kloudkl commented Jul 21, 2014

Let's be agile. Before the original AdaGrad works correctly, it's a low priority to implement any other variant version.

}

template Solver<float>* GetSolver(const SolverParameter& param);
template Solver<double>* GetSolver(const SolverParameter& param);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if a template function is defined in the header file, you won't need the above two lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! It's been addressed in a recent commit.

@bhack
Copy link
Contributor

bhack commented Jul 30, 2014

@qipeng see also this post

@jeffdonahue jeffdonahue mentioned this pull request Aug 4, 2014
@shelhamer
Copy link
Member

@qipeng this is really nice work that I look forward to merging. Can you rebase and add tests now that @jeffdonahue has given an example by the SGDSolver tests in #855? Thanks!

@qipeng
Copy link
Contributor Author

qipeng commented Aug 21, 2014

@shelhamer I just made a commit including solver unit tests for Nesterov and AdaGrad. Some code are commented out, and here are some explanations:

  1. It's a lot more difficult to verify the history of Nesterov and AdaGrad. These histories cannot be directly (by that I mean easily) computed from the solver update;
  2. For AdaGrad it doesn't make sense to check momentum.

Also it should be noted that given the framework of caffe where only one gradient evaluation and one update can be performed in each iteration, I implemented Nesterov in a slightly different way (I call it "step back and over-step" in the code), so basically the only problem is that after the solver finishes solving, the parameter is at x_t+mom*v_t instead of x_t defined in the original algorithm. This shouldn't be a huge problem in practice though, since near the optimum the gradients should be close to zero.

@jeffdonahue
Copy link
Contributor

Great work @qipeng! I consolidated the tests into one file to avoid duplicating so much code -- I made a PR to your branch.

I'd like to see a couple more things before merging this:

  • You commented out the part of CheckLeastSquaresUpdate that checks the value stored in history_, presumably because you aren't storing the actual update history for these solvers. It's fine if the history in your solvers doesn't match SGD's history, but you should still check that its value is in fact whatever you expect it to be; otherwise the "inductive" principle behind the tests doesn't hold up, since the history value from iter K is assumed to be correct when computing the update value at K+1.
  • Since momentum is not used in the AdaGradSolver, please add something like CHECK_EQ(0, momentum) << "Momentum cannot be used with AdaGrad."; somewhere in the AdaGrad code to die if someone tries to use it.
  • Please add an example for each of the new solvers -- MNIST is fine. The autoencoder is sort of a canonical example of a difficult-to-optimize deep architecture and Sutskever et al. [1] showed NAG worked well for it, so I think that would be a good choice. If it's too hard to get working (ideally better than the SGD example) though, LeNet is fine.

[1] http://www.cs.toronto.edu/~fritz/absps/momentum.pdf

@jeffdonahue
Copy link
Contributor

oops, sorry about my first TODO bullet point; I just reread comment (1) from your previous post -- I'll think about it a bit.

@qipeng
Copy link
Contributor Author

qipeng commented Aug 26, 2014

@jeffdonahue Thanks for the comments! I've added sample scripts for MNIST, as well as a sanity check of momentum for AdaGradSolver in the last commit.

@jeffdonahue
Copy link
Contributor

Thanks @qipeng! Made a quick pass through your added code and made a comment, but will do a more thorough (and quite possibly final!) review later.

@qipeng
Copy link
Contributor Author

qipeng commented Aug 26, 2014

@jeffdonahue Thanks for the careful review! Turns out after caffe's main executable's been unified, I forgot to add solver switch to that code, so I've been testing SGDSolver on that .prototxt file for a while. Now everything should be fixed. :)

@jeffdonahue
Copy link
Contributor

Thanks for the fixes! This is merged into dev.

I did a bit of grooming on the MNIST autoencoder example, including of the existing mnist_autoencoder.prototxt:

-use the new MNIST LMDBs rather than LevelDBs
-test on the training set (actually possible to correctly do now thanks to MNIST LMDBs)
-compute both L2 and cross-entropy loss (but still optimize only cross-entropy loss) at both train and test time

I also changed the existing MNIST autoencoder SGD solver to work much better, starting at a learning rate 1e-2 with a step policy, instead of fixed at 1e-4, and made the symmetric change to the Nesterov solver (the adagrad solver already uses base_lr: 0.01 and adjusts its own learning rate), and stopping at iteration 65K instead of 4 million or whatever it was before.

I also updated the training scripts to run from the root directory per the new convention.

After all those fixes, I also added random_seed: 1701 to the bottom of all 3 solvers and compared their performance (noting that the initial error numbers for the train & test nets were all the same as expected due to the seed). Here are the final results, with Nesterov doing the best overall (by both error metrics), followed closely by SGD, and with AdaGrad in last place (but also quite close, which IMO is impressive considering I manually set the stepsize for SGD and Nesterov solvers, unlike AdaGrad where only a fixed LR is specified):

AdaGrad:

I0901 13:36:30.007884 24952 solver.cpp:232] Iteration 65000, loss = 64.1627
I0901 13:36:30.007922 24952 solver.cpp:251] Iteration 65000, Testing net (#0) # train set
I0901 13:36:33.019305 24952 solver.cpp:289] Test loss: 63.217
I0901 13:36:33.019356 24952 solver.cpp:302]     Test net output #0: cross_entropy_loss = 63.217 (* 1 = 63.217 loss)
I0901 13:36:33.019773 24952 solver.cpp:302]     Test net output #1: l2_error = 2.40951
I0901 13:36:33.019785 24952 solver.cpp:251] Iteration 65000, Testing net (#1) # test set
I0901 13:36:33.462723 24952 solver.cpp:289] Test loss: 62.9406
I0901 13:36:33.462762 24952 solver.cpp:302]     Test net output #0: cross_entropy_loss = 62.9406 (* 1 = 62.9406 loss)
I0901 13:36:33.462770 24952 solver.cpp:302]     Test net output #1: l2_error = 2.41202

SGD:

I0901 13:35:20.426187 20072 solver.cpp:232] Iteration 65000, loss = 61.5498
I0901 13:35:20.426218 20072 solver.cpp:251] Iteration 65000, Testing net (#0) # train set
I0901 13:35:22.780092 20072 solver.cpp:289] Test loss: 60.8301
I0901 13:35:22.780138 20072 solver.cpp:302]     Test net output #0: cross_entropy_loss = 60.8301 (* 1 = 60.8301 loss)
I0901 13:35:22.780146 20072 solver.cpp:302]     Test net output #1: l2_error = 2.02321
I0901 13:35:22.780153 20072 solver.cpp:251] Iteration 65000, Testing net (#1) # test set
I0901 13:35:23.225303 20072 solver.cpp:289] Test loss: 60.6859
I0901 13:35:23.225347 20072 solver.cpp:302]     Test net output #0: cross_entropy_loss = 60.6859 (* 1 = 60.6859 loss)
I0901 13:35:23.225354 20072 solver.cpp:302]     Test net output #1: l2_error = 2.0505

Nesterov:

I0901 13:36:52.466069 22488 solver.cpp:232] Iteration 65000, loss = 59.9389
I0901 13:36:52.466099 22488 solver.cpp:251] Iteration 65000, Testing net (#0) # train set
I0901 13:36:55.068370 22488 solver.cpp:289] Test loss: 59.3663
I0901 13:36:55.068410 22488 solver.cpp:302]     Test net output #0: cross_entropy_loss = 59.3663 (* 1 = 59.3663 loss)
I0901 13:36:55.068418 22488 solver.cpp:302]     Test net output #1: l2_error = 1.79998
I0901 13:36:55.068425 22488 solver.cpp:251] Iteration 65000, Testing net (#1) # test set
I0901 13:36:55.583389 22488 solver.cpp:289] Test loss: 59.3155
I0901 13:36:55.583426 22488 solver.cpp:302]     Test net output #0: cross_entropy_loss = 59.3155 (* 1 = 59.3155 loss)
I0901 13:36:55.583434 22488 solver.cpp:302]     Test net output #1: l2_error = 1.84289

@jeffdonahue jeffdonahue closed this Sep 1, 2014
@shelhamer
Copy link
Member

@qipeng thanks for the solvers and @jeffdonahue thanks for the grooming.

@jeffdonahue Ideally you should keep the same format for merge commit messages i.e. "Merge pull request #741 from qipeng/solvers" to make it easier to go back and find the thread for a merge.

@jeffdonahue
Copy link
Contributor

train_net.cpp is reverted (re-deprecated) in dev. sorry about the merge commit message; will try to keep in github format during manual merges in the future.

@cancan101 cancan101 mentioned this pull request Sep 18, 2014
This was referenced Sep 18, 2014
@mohomran mohomran mentioned this pull request Sep 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants