Batch normalization layer with test and examples #1965

ducha-aiki · 2015-02-25T15:14:40Z

Implemented batch normalization layer (see http://arxiv.org/abs/1502.03167) based on @ChenglongChen and @Russell91 code with fixes and improvements.
Also added shuffling pool by @jjkjkj of the input data to the data_layer to not to have same files together in same batch. Tests passes and rebased on master.
For illustration of the effectiveness two examples of CIFAR-10 classifier with sigmoid non-linearity with and without batch normalization.

weiliu89 · 2015-02-25T15:29:52Z

Nice! I guess cifar_baseline is using ReLU instead of Sigmoid? Do you have any training examples of using ReLU + BN? Does it converge much faster than not using BN as stated in the paper?

ChenglongChen · 2015-02-25T16:36:59Z

@ducha-aiki,
Nice work!
I myself have the problem to figure out how to implement per-epoch (or mini-batch) shuffling.

As for the fixed mean and variance for inference, I think we can hack that by using two extra vars to keep track of the (exponential) moving averaging mean or variance, and use those instead of the current batch_mean and batch_variance for normalization in TEST phase.

Besides, the current implementation keeps two "copies" of the blob (buffer_blob and x_norm) which might be a bit memory consuming when using big deep net. It might worth considering switching to for-loop rather than the current BLAS vectorization as @Russell91 did in his init commit.

weiliu89 · 2015-02-25T17:05:15Z

@ChenglongChen

I used different "stage" at the beginning of TEST "phase" to compute mean and variance from a few training mini-batches. Moving average might work during training as mentioned in Section 3.1 in the paper.

ChenglongChen · 2015-02-25T17:27:20Z

@weiliu89,
I haven't totally understood what Caffe's params refer to. I have to check what's the difference between stage and phase. But, it sounds like you are doing what Algorithm 2 in the paper describes (the inference part), right?

weiliu89 · 2015-02-25T17:58:19Z

@ChenglongChen

Yes. I refer to how Caffe handle phase, and add several stages that bn_layer does different thing in different stage. This is the easiest way I can think of to implement Algorithm 2. It seems working well, but I haven't debug it though

sunbaigui · 2015-02-26T09:03:12Z

@ducha-aiki does master branch supports googlenet network now?
why not merge this into dev branch, and experiment on

Figure 4 in the paper.
Has anyone tried these models yet?

ducha-aiki · 2015-02-26T09:20:00Z

@weiliu89, we will add some graphs with ReLU-CIFAR later. For now, BN-model converges very fast, but a bit less accurate than non-BN. If we make a bit deeper model, than cifar-baseline, than BN converges faster and to more accurate one. And thank you for stage-phase suggestion

@sunbaigui, sure, if you sponsor us with some GPU. When I have trained my modification of the GoogLeNet, it took 3 weeks. Even if it would be 7 times faster, it is too much GPU time to spent for us :)
About master and dev - there is no dev anymore (see #1943 ) and master supports GoogLeNet.

@ChenglongChen we will think about loop-based implementation. However, you are welcome to make PR into this branch :)

jjkjkj · 2015-02-26T14:04:22Z

cifar_baseline is example/cifar/train_full.sh
@weiliu89 I could not archive cifar_baseline accuracy with batch normalization, without net modification( simply removing lrn, adding bn before or after every relu in examples/cifar10/cifar10_full_train_test.prototxt and only playing with learning rate)
I think this due to over-simplicity of this net and the fact than lrn for this net is enough normalization already.

I also trained variation of vgg16 on cifar with and without batch normalization
The only difference in architecture, compared to original vgg16, is shrieked down ip layers and adding dropout(not sure this is necessary for bn network)
Also I forgot to turn off PReLU and MSR weight filler in this networks, but I think that this makes little to no difference.

https://gist.github.com/jjkjkj/39e87099e9381e6886a5

For no_bn net base_lr=0.001 causes net to diverge. For bn net lr is first guess, so maybe with bigger lr it will converge faster and better.

@ChenglongChen @weiliu89 @ducha-aiki About test phase. I tested cifar_vgg16 whith range of batch sizes (2-250) in test phase and found very small changes in accuracy (with batch size 2 accuracy is only 1% less than with 250)

ducha-aiki · 2015-02-26T14:15:32Z

@shelhamer @longjon @jeffdonahue Could you please review this PR?

weiliu89 · 2015-02-26T15:47:47Z

I think that current PR doesn't compute mean and variance from training images (or moving mean and variance) during testing phase, but it compute mean and variance from test mini-batch, which I think is not exactly the same as described in the paper. I am not sure how much it affects the test accuracy.

yangyi02 · 2015-02-26T22:22:57Z

@jjkjkj For the cifar experiment, do you try the comparison adding bn before or after every relu? Does that matter?

jjkjkj · 2015-02-26T22:40:56Z

@yangyi02 Yes i tried and found no difference(with examples/cifar10/cifar10_full_train_test.prototxt). As i said i think that this net is bad example for batch normalization.
As for cifar_vgg16, now i'm training net without dropout.

nakosung · 2015-03-02T05:46:32Z

To feed evaluation network mean/var, setting mean = beta, var = (1/gamma)^2 will be OK? (Learned beta, gamma is similar to true mean/var?)

justfortest1 · 2015-03-02T07:56:07Z

@ducha-aiki when batchsize=1 in test phase, can it also work?

ducha-aiki · 2015-03-02T08:42:45Z

@justfortest1 no.

lsy1993311 · 2015-03-10T05:52:25Z

@weiliu89 Could you please share you implementation? I think different stages should be considered.

weiliu89 · 2015-03-10T13:56:38Z

@lsy1993311 My Caffe version is old and I am not familiar with how to upload the code and haven't tested it. The high level idea is to include set_stage(string) and stage() in include/caffe/common.hpp (Refer to set_phase() and phase() in the same file). Then in src/caffe/solver.cpp, I add a function at the beginning of TestAll() which tries to compute mean & std. In the function, I set phase to TRAIN, and include two stages by using set_stage() as described before. The first stage is called "aggregation" which does several iteration of Forward pass to aggregate mean & std from a few mini-batches; the second stage is called "finalize" which compute the final mean & std by dividing the number of mini-batches you have passed. Finally, in batch_norm_layer, I can call Caffe::stage() and implement some additional thing in order to handle different stage (e.g. "aggregate" and "finalize"). I won't go in details into how to do it as it should be trivial.

However I don't have time to really debug this thoroughly. One thing to notice is that what I described above needs to compute mean & std every time I call TestAll() which might not necessary because it costs extra computation during training. On the other hand, you can only call the function in Snapshot() and use moving mean & std during training as described in the paper (can set a different stage for doing this during training).

melgor · 2015-03-17T13:45:51Z

I have been testing that version with my data set using VGG16 model. And it works, speed up converge.
But it does not produce better result than normal version. Even more, batch size on test influence on accuracy.
@weiliu89 propose resolving that problem. I will try to implement it. But this is only mismatch between Caffe code and Paper?

ducha-aiki · 2015-03-17T13:50:06Z

@melgor, as far as I see - yes. It would be great, if you help to implement @weiliu89 solution.

melgor · 2015-03-18T08:09:58Z

I have just found implementation of @ChenglongChen which implement BN with right code in Test Phase. It save the mean and variance in BN layer. It looks like better implementation because it does not need to change Solver code. But it does not calculate mean and variance within all Train data, only update the value of statistic using ex: S_{t+1} = decay * Y_{t+1} + (1 - decay) * S_{t}, where decay is parameter.

What do you think about such implementation? @ChenglongChen, does it work better than mini-batch statistic?

More information here:
https://github.com/ChenglongChen/caffe-windows/blob/master/src/caffe/layers/bn_layer.cpp

ChenglongChen · 2015-03-18T12:54:10Z

Sorry guys. I have been caught up with work this moment, so I don't have time to test it out thoroughly. The use of exponentially weighted moving average (EWMA) is simply due to the fact that BN tends to keep the distribution of activation stable (?).

The algo2 in the paper is a bit complicate:

before the TEST phase, we forward a few mini-batch to compute the mean & var for the 1st BN layer, then we save this mean & var for other round inference (& forward)
we then forward those mini-batch to compute the mean & var for the 2nd BN layer, notice that the normalization part of the 1st BN layer is carried out using mean & var computed in step1 not the mini-batch statistics.
similarly, we perform the above for the rest BN layers.
after computing all the mean & var, we then have the inference BN network.

jjkjkj · 2015-03-25T11:34:04Z

@lsy1993311 As exepcted: slightly faster initial training but strong overfitting(it's natural whet parameters >> dataset). So, BN does not always remove need of dropout.

nakosung · 2015-04-14T14:44:50Z

What if removing BN layers at testing phase? I mean that no normalization/reconstruction will be used during testing.

borisgin · 2015-04-19T14:56:49Z

Hi Dmitro, nice work!
When I tryied to train cifar_bn, I got endless warnings "force_color: ...". Looks like a minor bug, So I just commented these lines in data_layer. cpp.
Another observation: when I put bn layer just before convolutional & Ip layer, I obtained faster convergence then when bn_layer is located after convolutional layer:

ducha-aiki · 2015-04-20T08:58:33Z

@weiliu89, thanks for catch!

@borisgin Hi Boris, thanks for for observation. It is interesting, that is it very architecture dependent: we have tried on other architectures and there was no difference, as stated in original paper. Still lot of place for exploring :)
The force_color bug is fixed now, thanks.

andrei-pokrovsky · 2015-07-31T21:30:52Z

weiliu89>>I think that current PR doesn't compute mean and variance from training images (or moving mean and variance) during testing phase, but it compute mean and variance from test mini-batch, which I think is not exactly the same as described in the paper. I am not sure how much it affects the test accuracy.

FWIW I agree, that's different from the paper description. Also what if the test data contains only one sample (single image inference)?

xuzhm · 2015-08-05T11:44:19Z

bug?? occurs "nan" .........

after debug found that batch_variance_ have negative number .....

AIROBOTAI · 2015-08-22T14:20:47Z

Hi @ducha-aiki, and others, thanks for your excellent work! Here I have a question about your code. Could you please explain to me what is the purpose for the codes starting from the line No. 153 to line NO. 185 in the function "void DataLayer::InternalThreadEntry()" in data_layer.cpp? I just cannot figure out why these codes should be there when the "datum.encoded()" is true. Thx a lot in advance!

ducha-aiki · 2015-08-22T14:31:52Z

@AIROBOTAI it needs for data shuffling. With batch normalization, it is important that network don`t see same images together in batch, so this lines implement shuffling.
As for the encoded datum - it is only mode we use, so we implemented shuffling only for it.

AIROBOTAI · 2015-08-22T16:59:32Z

@ducha-aiki Thanks for your prompt reply! But what should I do if my datum is NOT encoded? I have checked the return value of datum.encoded() to find it to be false. So in this case, those lines of code for shuffling will be jumped over.

ducha-aiki · 2015-08-22T17:02:05Z

@AIROBOTAI then you can regenerate LMDB with encoded key, or add same lines to unencoded branch of if :)

ctensmeyer · 2015-08-22T17:12:43Z

One alternative to true shuffling is to do random skips. The DataLayer has
a parameter called rand_skip that causes the DB cursor to start in a random
position in the DB. It is easy to extend that concept so that the DB skips
a random number of instances every time it advances its cursor. You can do
so by modifying the last part of the InternalThreadEntry() to run the
cursor->next() in a loop for a random number of iterations. This way the
mini-batch is never composed of the same instances and the probability of
two instances being in the same batch is inversely proportional to the size
of the skip. I find in practice that this doesn't slow down DB reads very
much because it is still kind of sequential.

Hope that helps.

On Sat, Aug 22, 2015 at 11:02 AM, Dmytro Mishkin [email protected]
wrote:

@AIROBOTAI https://github.com/AIROBOTAI then you can regenerate LMDB
with encoded key, or add same lines to unencoded branch of if :)

—
Reply to this email directly or view it on GitHub
#1965 (comment).

AIROBOTAI · 2015-08-24T15:01:00Z

@ducha-aiki I have modified some seemingly confusing codes in your code to make it more straight to me. Now it works, thanks again!

AIROBOTAI · 2015-08-24T15:06:00Z

@waldol1 thanks for your suggestion! Your method seems more easy to use than the shuffling pool proposed in this pull request. I'd also like to know whether you have tested the test accuracy using your method and how is the performance. @ducha-aiki what's your comments on this new shuffling method? Thanks for your all!

ctensmeyer · 2015-08-24T16:32:42Z

I haven't tested this shuffling method with regards to BatchNorm, but it
seemed to help a little on an mnist autoencoder that I was testing when the
training set was small.

On Mon, Aug 24, 2015 at 9:06 AM, AIROBOTAI [email protected] wrote:

@waldol1 https://github.com/waldol1 thanks for your suggestion! Your
method seems more easy to use than the shuffling pool proposed in this pull
request. I'd also like to know whether you have tested the test accuracy
using your method and how is the performance. @ducha-aiki
https://github.com/ducha-aiki what's your comments on this new
shuffling method? Thanks for your all!

—
Reply to this email directly or view it on GitHub
#1965 (comment).

talda · 2015-08-30T07:27:26Z

@ducha-aiki @shelhamer - is there still plan to pull this in?
If so two suggestions:

separate the data shuffling into another PR, also find a solution that works for all types of data layers.
keep a running average of batch statistics so this PR could be used for classifying a single image.

bhack · 2015-08-30T11:06:23Z

@talda This is still to young to be reviewed. It is only six months old. :)

ducha-aiki · 2015-08-30T11:15:48Z

@bhack actually, this PR is not needed at all, if believe to Google.
Here are slides http://lsun.cs.princeton.edu/slides/Christian.pdf
Pre-last one says:
"Releasing Pretrained Inception and MultiBox
Academic criticism: Results are hard to reproduce
We will be releasing pretrained Caffe models for:
● GoogLeNet (Inception 5)
● BN-Inception (Inception 6)
● MultiBox-Inception proposal generator (based on
Inception 6)
Contact @Yangqing "

talda · 2015-08-30T12:22:40Z

@bhack too bad Github does not have like/upvote button for comments. I would defiantly upvote your previous comment.

erogol · 2015-08-30T18:07:08Z

@ducha-aiki Can I classify a single image by using that PR's modifications and batch normalization ?

ducha-aiki · 2015-08-30T19:13:27Z

@erogol No.

erogol · 2015-09-03T08:45:56Z

@ducha-aiki I applied moving average and it now works.

ducha-aiki · 2015-10-23T06:43:26Z

Closed, because of #3229 merge.The scale + bias could be taken from #2996

Added bn_layer.[cpp/cu] with corresponding hpp file. Performs batch-normalization with in-place scale/shift. Originally created by ducha-aiki: https://github.com/ducha-aiki ChenglongChen: https://github.com/ChenglongChen Russell91: https://github.com/Russell91 jjkjkj: https://github.com/jjkjkj detailed discussion of this implementation can be found at: BVLC#1965

Added batch normalization layer with test and examples

15117f1

This was referenced Feb 25, 2015

Added batchnorm layer with tests #1891

Closed

Added batchnorm layer. #1867

Closed

make lint happy on test_bn_layer

a144318

ducha-aiki mentioned this pull request Feb 27, 2015

Implement Google Batch Normalization to beat human on ImageNet classification task #1990

Closed

ducha-aiki mentioned this pull request Mar 16, 2015

Has anyone test Batch Normalization technique in general models? #2132

Closed

ChenglongChen mentioned this pull request Mar 18, 2015

BN inference lim0606/ndsb#1

Open

Rebased and remove redundant code

2ca853e

beniz mentioned this pull request Jun 6, 2015

Support for batch normalization jolibrain/deepdetect#7

Closed

4 tasks

lukeyeager mentioned this pull request Sep 1, 2015

Batch Normalization in Caffe? NVIDIA/caffe#29

Closed

seanbell mentioned this pull request Sep 5, 2015

BN Inception #3035

Closed

ducha-aiki mentioned this pull request Oct 7, 2015

Batch Normalization Layer #3161

Closed

cdoersch mentioned this pull request Oct 21, 2015

Yet another batch normalization PR #3229

Merged

ducha-aiki closed this Oct 23, 2015

williford mentioned this pull request Oct 18, 2016

How you implemented BatchNorm? williford/segmentation-caffe#1

Closed

Batch normalization layer with test and examples #1965

Batch normalization layer with test and examples #1965

Conversation

ducha-aiki commented Feb 25, 2015

weiliu89 commented Feb 25, 2015

ChenglongChen commented Feb 25, 2015

weiliu89 commented Feb 25, 2015

ChenglongChen commented Feb 25, 2015

weiliu89 commented Feb 25, 2015

sunbaigui commented Feb 26, 2015

ducha-aiki commented Feb 26, 2015

jjkjkj commented Feb 26, 2015

ducha-aiki commented Feb 26, 2015

weiliu89 commented Feb 26, 2015

yangyi02 commented Feb 26, 2015

jjkjkj commented Feb 26, 2015

nakosung commented Mar 2, 2015

justfortest1 commented Mar 2, 2015

ducha-aiki commented Mar 2, 2015

lsy1993311 commented Mar 10, 2015

weiliu89 commented Mar 10, 2015

melgor commented Mar 17, 2015

ducha-aiki commented Mar 17, 2015

melgor commented Mar 18, 2015

ChenglongChen commented Mar 18, 2015

jjkjkj commented Mar 25, 2015

nakosung commented Apr 14, 2015

borisgin commented Apr 19, 2015

ducha-aiki commented Apr 20, 2015

andrei-pokrovsky commented Jul 31, 2015

xuzhm commented Aug 5, 2015

AIROBOTAI commented Aug 22, 2015

ducha-aiki commented Aug 22, 2015

AIROBOTAI commented Aug 22, 2015

ducha-aiki commented Aug 22, 2015

ctensmeyer commented Aug 22, 2015

AIROBOTAI commented Aug 24, 2015

AIROBOTAI commented Aug 24, 2015

ctensmeyer commented Aug 24, 2015

talda commented Aug 30, 2015

bhack commented Aug 30, 2015

ducha-aiki commented Aug 30, 2015

talda commented Aug 30, 2015

erogol commented Aug 30, 2015

ducha-aiki commented Aug 30, 2015

erogol commented Sep 3, 2015

ducha-aiki commented Oct 23, 2015