rmsprop causing strange loss of accurracy part way through training #99

genixpro · 2016-03-07T16:05:55Z

I've been using Adagrad normally, but I decided to try Rmsprop to see if it improves accuracy. In our tests, rmsprop seemed to converge faster and to a higher maximum, so we started training our big models using it. However, I have noticed something strange that happens during training. It seems like randomly, the accuracy will suddenly precipitously drop, with loss suddenly shooting up. Sometimes I've even seen "infinity" in our testing results when this happens - as if one of the model parameters got accidentally changed to infinity, causing a cascade of failed calculations. See these results:

This is one of the first rmsprop runs:

decayed learning rate by a factor 0.97 to 0.012665023782736
iteration 6800/17090000, seq_length = 500, loss = 25.60889482, loss/seq_len = 0.02560889, gradnorm = 1.3048e+01. Time Elapsed: 3070 seconds
iteration 6850/17090000, seq_length = 500, loss = 35.99438245, loss/seq_len = 0.03599438, gradnorm = 1.7849e+01. Time Elapsed: 3158 seconds
iteration 6900/17090000, seq_length = 500, loss = 14.20753793, loss/seq_len = 0.01420754, gradnorm = 1.6731e+01. Time Elapsed: 3185 seconds
iteration 6950/17090000, seq_length = 500, loss = 31.02228065, loss/seq_len = 0.03102228, gradnorm = 2.1421e+01. Time Elapsed: 3205 seconds
decayed learning rate by a factor 0.97 to 0.012285073069254
iteration 7000/17090000, seq_length = 500, loss = 126072.68073179, loss/seq_len = 126.07268073, gradnorm = 9.3243e+03. Time Elapsed: 3183 seconds
iteration 7050/17090000, seq_length = 500, loss = 71258.54748077, loss/seq_len = 71.25854748, gradnorm = 9.2335e+03. Time Elapsed: 6792 seconds
iteration 7100/17090000, seq_length = 500, loss = 59993.95191604, loss/seq_len = 59.99395192, gradnorm = 8.9946e+03. Time Elapsed: 3071 seconds
iteration 7150/17090000, seq_length = 500, loss = 80161.97462837, loss/seq_len = 80.16197463, gradnorm = 9.0648e+03. Time Elapsed: 3223 seconds
decayed learning rate by a factor 0.97 to 0.011916520877176
iteration 7200/17090000, seq_length = 500, loss = 62363.37415352, loss/seq_len = 62.36337415, gradnorm = 6.3187e+03. Time Elapsed: 3077 seconds
iteration 7250/17090000, seq_length = 500, loss = 77396.41234885, loss/seq_len = 77.39641235, gradnorm = 6.3629e+03. Time Elapsed: 2930 seconds
iteration 7300/17090000, seq_length = 500, loss = 66974.65153092, loss/seq_len = 66.97465153, gradnorm = 5.9655e+03. Time Elapsed: 2989 seconds
iteration 7350/17090000, seq_length = 500, loss = 34369.91119689, loss/seq_len = 34.36991120, gradnorm = 5.8163e+03. Time Elapsed: 2813 seconds

Notice what happens around iteration 7000. The loss just shoots up all of the sudden. If I check the testing results, the testing loss is "infinify". It goes back to normal in subsequent iterations. At first I thought it was a rare hardware issue, but then a different model did the same thing:

Iteration Time Training Loss Testing Loss Testing # Correct Testing # Wrong Testing # Total Accurracy
1000 3032 1.998393671 3.460828 8220 140937 149157 5.51
2000 3321 1.506352061 1.13135852 106180 42977 149157 71.19
3000 3389 0.6526988754 0.6081444923 126793 22364 149157 85.01
4000 3382 0.4032474733 0.4583896942 131588 17569 149157 88.22
5000 3075 2.197617545 17.48262351 60603 88554 149157 40.63

In this second example, I can see the point where the loss starts shooting up in the logs. It doesn't appear to be instantaneous - perhaps an error is made in one iteration that slowly cascades until it affects everything.

decayed learning rate by a factor 0.97 to 0.01825346
iteration 4400/17090000, seq_length = 500, loss = 0.38249470, gradnorm = 8.0499e+01. Time Elapsed: 3280 seconds
iteration 4450/17090000, seq_length = 500, loss = 0.37212085, gradnorm = 2.9393e+02. Time Elapsed: 3426 seconds
iteration 4500/17090000, seq_length = 500, loss = 0.36586265, gradnorm = 8.7689e+01. Time Elapsed: 3288 seconds
iteration 4550/17090000, seq_length = 500, loss = 0.35865728, gradnorm = 5.4034e+01. Time Elapsed: 3416 seconds
decayed learning rate by a factor 0.97 to 0.0177058562
iteration 4600/17090000, seq_length = 500, loss = 0.40036575, gradnorm = 7.8565e+01. Time Elapsed: 3327 seconds
iteration 4650/17090000, seq_length = 500, loss = 0.42660431, gradnorm = 2.2500e+02. Time Elapsed: 3309 seconds
iteration 4700/17090000, seq_length = 500, loss = 0.49915671, gradnorm = 4.2741e+03. Time Elapsed: 3237 seconds
iteration 4750/17090000, seq_length = 500, loss = 0.86534878, gradnorm = 3.5756e+03. Time Elapsed: 3251 seconds
decayed learning rate by a factor 0.97 to 0.017174680514
iteration 4800/17090000, seq_length = 500, loss = 1.24005108, gradnorm = 4.3706e+03. Time Elapsed: 3232 seconds
iteration 4850/17090000, seq_length = 500, loss = 1.22130984, gradnorm = 5.6758e+03. Time Elapsed: 3117 seconds
iteration 4900/17090000, seq_length = 500, loss = 6.12171381, gradnorm = 9.2302e+03. Time Elapsed: 3232 seconds
iteration 4950/17090000, seq_length = 500, loss = 11.80134205, gradnorm = 9.0186e+03. Time Elapsed: 3029 seconds
decayed learning rate by a factor 0.97 to 0.01665944009858
iteration 5000/17090000, seq_length = 500, loss = 17.11424646, gradnorm = 6.3805e+03. Time Elapsed: 3075 seconds

You can see loss going down, and then it starts going up again slowly, which isn't totally unusual. But then it quickly spikes and never recovers! We didn't see any "infinity's" in this run but the same curious sudden change in loss is visible. I wouldn't be surprised if there was actually an infinity, but in one of the iterations inbetween where we don't record results.

Does anyone have any insight into what might be happening? I haven't ever seen something like this when using Adagrad - only with the models that we train using rmsprop.

simopal6 · 2016-04-21T14:56:39Z

Same thing here (see attached picture) and it doesn't seem to recover after the drop.

Kaixhin · 2016-05-07T09:46:34Z

I'm not sure if this is what is causing your issue but line 52 looks a little odd to me - usually epsilon should be added inside the square root. I've also seen other implementations use a default of 1e-6 instead of 1e-8.

andreaskoepf · 2016-06-08T22:32:06Z

Could somebody who can reliable reproduce the problem please try to use a higher epsilon value, e.g. something between 0.001 and 0.1? This should limit the maximum gain to 1000x or 10x.

Especially near a local minimum the gradient becomes small.. Since rmsprop scales the gradient the resulting gradient step might simply become too large.

@Kaixhin Could you explain why you would prefer to add epsilon before the sqrt-operation? I saw in your rmspropm.lua you are using a relatively high epsilon value of 0.01. Has this resolved the issue for you?

andreaskoepf · 2016-06-08T22:53:10Z

@Kaixhin it is indeed a bit inconsistent that adagrad adds a constant value of 1e-10 after the sqrt while adadelta does it before the sqrt operation.

In the nice blog post An overview of gradient descent optimization algorithms by @sebastianruder the epsilon is added before taking the sqrt (only for Adam it is added afterwards).

Kaixhin · 2016-06-08T23:05:11Z

@andreaskoepf My bad, I can't think of a reason to use one over the other - either way prevents a divide by zero. But obviously putting it inside or outside changes the value of epsilon that should be used. My reference of using an epsilon of 1e-6 and inside the square root is Lasagne. I don't know how important this value is to optimisation practically.

As for rmspropm it is actually rmsprop with momentum, as introduced in the Graves paper referenced. I've never experimented with changing any of the default values or replacing it with the original rmsprop.

andreaskoepf · 2016-06-09T07:09:17Z

@Kaixhin thx for the link to the Lasagne impl. Lasagne has a default exponential moving average factor rho of 0.9 which is in my opinion more reasonable than torch's current analog alpha default value of 0.99. Lasagne's eps value of 1e-6 inside the the sqrt() means that even with a zero gradient the value will never fall below 0.001 for that weight.

Kaixhin · 2016-06-09T08:03:13Z

@andreaskoepf If you have some tests where you observe Torch's rmsprop acting strangely, perhaps try using the formula/values from Lasagne. If that works then we should submit a PR to patch this.

denis-bz · 2016-12-20T11:58:32Z

On epsilon inside or outside the sqrt:
x / sqrt( x^2 + r^2 )
is a smooth approximation to sign(x), ~ 0.9 at x = 2 r, a sigmoid --
~ x / r for small x, +- 1 for large.

(Does magnifying tiny values make sense ? It's anti soft-threshold,
intuition anyone ?)

Kaixhin mentioned this issue May 7, 2016

sharedRmsProp and async nature params Kaixhin/Atari#15

Merged

mrTsjolder mentioned this issue Jul 21, 2017

'''Train a simple deep CNN on the CIFAR10 small images dataset." after a while it gets worse keras-team/keras#7384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rmsprop causing strange loss of accurracy part way through training #99

rmsprop causing strange loss of accurracy part way through training #99

genixpro commented Mar 7, 2016

simopal6 commented Apr 21, 2016 •

edited

Loading

Kaixhin commented May 7, 2016

andreaskoepf commented Jun 8, 2016

andreaskoepf commented Jun 8, 2016 •

edited

Loading

Kaixhin commented Jun 8, 2016

andreaskoepf commented Jun 9, 2016

Kaixhin commented Jun 9, 2016

denis-bz commented Dec 20, 2016

rmsprop causing strange loss of accurracy part way through training #99

rmsprop causing strange loss of accurracy part way through training #99

Comments

genixpro commented Mar 7, 2016

simopal6 commented Apr 21, 2016 • edited Loading

Kaixhin commented May 7, 2016

andreaskoepf commented Jun 8, 2016

andreaskoepf commented Jun 8, 2016 • edited Loading

Kaixhin commented Jun 8, 2016

andreaskoepf commented Jun 9, 2016

Kaixhin commented Jun 9, 2016

denis-bz commented Dec 20, 2016

simopal6 commented Apr 21, 2016 •

edited

Loading

andreaskoepf commented Jun 8, 2016 •

edited

Loading