Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale and Bias Layers #3591

Merged
merged 2 commits into from
Jan 27, 2016
Merged

Scale and Bias Layers #3591

merged 2 commits into from
Jan 27, 2016

Conversation

jeffdonahue
Copy link
Contributor

This PR combines @ducha-aiki's ChannelwiseAffineLayer (#2996) with my Scalar (#3021) and Bias (#3550) for appropriate credit and should have all the advantages of each. (After some discussion we decided to name the scaling part Scale for simplicity.) ScaleLayer alone can now replace ChannelwiseAffineLayer by setting scale_param { bias_term: true } with a combined GPU kernel to both scale and add to the input, which should give the performance advantage that @ducha-aiki measured as part of the discussion in #3229, while still allowing for the modularity of separating the two when desired. Both ScaleLayer and BiasLayer can take a single bottom to learn the scale/bias as a parameter, or two bottoms so the scale/bias can be taken as an input*. The dimensions of the scale/bias blob may be any subsequence of the dimensions in the first bottom. The operation can be thought of as a (virtual) reshaping+tiling to the shape of the first Blob, followed by element-wise addition/multiplication. The operations could alternatively be performed by composing Reshape, Tile, and Eltwise layers, but in any case except where EltwiseLayer alone suffices, this would be less efficient in terms of memory and performance, often substantially so.

@ducha-aiki hopefully this is the best of both worlds in terms of performance, generality, and modularity -- let me know if you have any feedback though. Otherwise we will try to get this reviewed and merged soon.

*I'm happy to see this excess logic simplified/removed if/when @longjon's param_bottom and ParameterLayer work is merged, but for now this is the best way I could think of to address many different use cases for the layers.

and ScaleLayer.  The behavior of ChannelwiseAffineLayer can be
reproduced by a ScaleLayer with `scale_param { bias_term: true }`.

BiasLayer and ScaleLayer each take 1 or 2 bottoms, with the output having
the same shape as the first.  The second input -- either another  bottom or a
learned parameter -- will have its axes (virtually) broadcast and tiled to have
the same shape as the first, after which elementwise addition (Bias) or
multiplication (Scale) is performed.
@ducha-aiki
Copy link
Contributor

@jeffdonahue LGTM.

@shelhamer
Copy link
Member

Thanks @ducha-aiki and @jeffdonahue for the scale + bias layers!

shelhamer added a commit that referenced this pull request Jan 27, 2016
@shelhamer shelhamer merged commit dc831aa into BVLC:master Jan 27, 2016
@siddharthm83
Copy link

cool!

@ducha-aiki
Copy link
Contributor

@shelhamer nice!

@lfrdm
Copy link

lfrdm commented Jan 29, 2016

Hi guys. Thanks a lot for the great work with the batch normalization layer. To understand the correct implementation accordingly the paper for the train_val.prototxt: First one has to compute the normalized batch with the layer type batchNorm (after my ReLUs) and then use the scaleLayer with scale_param{ bias_term: true } and the biasLayer to learn the scale and bias of my normalized batch?

@jeffdonahue
Copy link
Contributor Author

Batch norm is, in the original paper and in typical use, placed before the activation (ReLU or otherwise), not after. You should use ScaleLayer with bias_term: true (should give the best performance), or separately use ScaleLayer (with bias_term: false, the default) followed by BiasLayer. A ScaleLayer with bias_term followed by a BiasLayer would wastefully learn an extra bias (and effectively double its learning rate).

@lfrdm
Copy link

lfrdm commented Jan 29, 2016

Thanks @jeffdonahue, for the answer and explanation. I'm curious though, how the normalization is handled while testing? In the paper introduction they refer to only use the normalization on the training batches. Do I have to use different use_global_stats for the training and testing phase or is it handled internally in the batch_norm_layer?

@cuihenggang
Copy link

Does anyone have the new train_val file for the Inception-BN network with the ScaleBias layers added (for the ILSVRC12 dataset)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants