Add BiasLayer to add two Blobs with broadcasting #3550
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds BiasLayer, designed analogously to ScalarLayer (#3021), to add blobs with arbitrary axes broadcasted. This could be used together with ScalarLayer to learn the batch norm scale and shift parameters. It could also be used independently anywhere in a network to learn a bias without a corresponding multiplication. And even more generally, it could be used to efficiently add two blobs with any number of corresponding axes, which can currently only be accomplished (in the most general case) rather inefficiently: with a pair of
Reshape
s andTile
s (to broadcast leading and trailing axes) followed by theEltwise
SUM
operation.This is currently based on ScalarLayer (for caffe.proto ID sequencing), with the last two commits being the relevant ones -- I'm happy to rebase this without ScalarLayer if we want to merge this before or without that.
Both this and ScalarLayer can either take two bottoms, specifying both inputs to the function, or take a single bottom and learn the second as a parameter.
A different approach for learning the BN scale/shift parameters that I haven't looked at yet is in #2996 (by @ducha-aiki), which learns both sets of parameters together. @cdoersch and I and anyone else interested (possibly @longjon and @shelhamer) should take a look at both and evaluate the benefits, with merge priority for any shared functionality given to @ducha-aiki's #2996 as the earlier PR.
Personally I do like the approach of having layers do as little as possible, which is why for my own work I've taken the approach of using two independent layers.