Weight Sharing #546

shelhamer · 2014-06-26T19:26:48Z

[Original PR notes copied from #500. This PR replaces it.]

This adds the ability to share parameters between layers, which has a number of applications, the canonical one perhaps being recurrent neural network (RNN) training.

To share weights between two or more layers with parameters (currently just InnerProductLayers and ConvolutionLayers), specify the same param for all of these layers. (You can also name the biases with a second param, as in the blobs_lr and weight_decay parameters.) You can see a very simple example of this in src/caffe/test/test_net.cpp: see the unit test named InitDiffDataSharedWeightsNet:

layers: {
  name: 'innerproduct1'
  type: INNER_PRODUCT
  inner_product_param {
    num_output: 10
    bias_term: false
    weight_filler {
      type: 'gaussian'
      std: 10
    }
  }
  param: 'sharedweights'
  bottom: 'data'
  top: 'innerproduct1'
}
layers: {
  name: 'innerproduct2'
  type: INNER_PRODUCT
  inner_product_param {
    num_output: 10
    bias_term: false
    weight_filler {
      type: 'gaussian'
      std: 10
    }
  }
  param: 'sharedweights'
  bottom: 'data'
  top: 'innerproduct2'
}

This means layers innerproduct1 and innerproduct2 are sharing the same set of weights as they've both specified param: 'sharedweights'. And in this case they also take the same bottom blob, (data), so their outputs, top blobs innerproduct1 and innerproduct2, should be identical (so this is not actually something you'd ever want to do; I do it there just for testing purposes).

Note that in this case we specify only one blob name because we've set bias_term: false; if we didn't have bias_term: false we'd need to specify two params, but probably the second one should be empty unless we actually want to share biases. (Specifying the empty string as a param is equivalent to not specifying a param in my implementation.)

param: 'sharedweights'
param: ''

The entire implementation is in Net::Init, Net::AppendParam, and Net::Update. Init figures out which layer will actually "own" the shared param (the first one to list its param), and Update adds the non-owned layers' computed diffs into the diff of the owner blob, then only actually performs updates on owned blobs. Memory-wise, all shared blobs actually point to the same memory location for the parameter's data, but still have separately allocated diff blobs, as the logic to handle learning rate, weight decay, etc is still handled by the Solver (which is blissfully unaware that parameters can be shared).

Open to hearing feedback on the interface, implementation, etc. I'm not sure I'm happy with param as the name of the field, I think it would be less ambiguous to use param_name or something, but would be inconsistent with the other per-parameter field blobs_lr (and actually to be consistent with that it should be blobs_name, but I strongly prefer the singular here..).

bhack · 2014-06-26T19:51:03Z

Have this any relation with Composite layer with routing capabilities?

shelhamer · 2014-06-26T20:03:16Z

@bhack Not that I can see, although I could have missed it since I've only just taken a cursory glance at the Composite layer.

Caffe already understands DAG models by inserting split layers at forks where a top blob is the bottom blob of more than one layer. With shared weights, one can do the same convolutions, inner products, or whatever on different inputs by defining the layer with the same param name where desired. In this way a composite layer is not needed.

That said, it would be nice to have a shorthand in our model definitions with this kind of structure instead of redefining the shared layer over and over with different bottoms and tops. A multiresolution model is a good example: the same convolutions should be done at every level of the pyramid, and it would be more concise to not write this down exhaustively.

The Composite layer raises another point: Caffe could execute layers at the same topological depth simultaneously. At the moment execution is totally serial.

shelhamer · 2014-06-26T20:05:57Z

I made the changes I suggested at #500 (comment).

@jeffdonahue please review -- if you don't like my follow-up renaming commits, feel free to drop them.

This is otherwise ready for merge IMHO.

bhack · 2014-06-26T20:22:52Z

Yes with shared weight it think we cover almost all. I agree with you that composite layer other that parallel execution let also a more simple network notation in yaml that probably could be adopted in some form in caffe protobuf.

jeffdonahue · 2014-06-26T20:50:30Z

Cool, thanks for the rebase and name cleanup!

Weight Sharing

ashafaei · 2014-08-28T02:37:30Z

src/caffe/net.cpp

+      LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode();
+    }
+  }
+  // Now, update the owned parameters.


@shelhamer, If I understood correctly, you add up all the diffs to the owner_diff and then update the parameters accordingly, right? Does it imply if the layer owning this blob is not contributing to the loss, the updated diff from other layers that use this blob but don't own it, will not be used?

We either must change the ownership from the first layer that mentions the param to the first layer that mentions the param and participates in the loss, or we need to fix the layer_need_backward_[layer_id] accordingly. Is my concern valid?

@ashafaei Was this fixed in the current version?

@ashafaei @abhi2610 This was never actually an issue although it was worth raising since you need to know how the Net, Solver and Blobs cooperate. The loss / backward logic only decides whether backward is computed for a layer or not. Net::Update() is always called by the solver, all of the shared weight params will accumulate their diff with the owner in the loop at 478, and then Blob::Update() will always be called for every weight owner as in the loop at 501. This does bring up why Blob::Update() is unconditional when it could be skipped but that's another matter. Thanks @jeffdonahue for discussion.

robertsdionne · 2014-09-10T00:28:13Z

I see that this pull request adds weight sharing, which recurrent nets will need, however caffe takes fixed protocol buffer net descriptors. What would be the best way to implement a mechanism to allow dynamic repetition of sets of layers for input sequences?

For instance, if I want to train a recurrent net for part of speech tagging, for a given sentence with n words, I'll have the following inputs and the following outputs:

inputs
r_0 : the initial recurrent state
x_1: the word vector for the first word
x_2: the second word vector
...
x_n: the nth word vector

outputs
t_1: the tag assigned to the first word
t_2: the second tag
...
t_n: the nth tag
r_n+1: the final recurrent state

If the recurrent part has several layers that are repeated for each input word, such as a dropout layer, a concatenation layer, an inner product layer for the recurrent output, a rectified linear layer for the recurrent output, an inner product layer for the classification output, and a softmax layer for the classification output, you can group these into virtual layers that take the following inputs and outputs:

inputs
r_i-1: the previous recurrent state
x_i: the current word vector

outputs
t_i: the current tag
r_i: the current recurrent state

Then, for a given sentence, you could chain together several of these modules made of several layers to feed the recurrent state from one to the next. The repeated configuration could be described by a protocol buffer.

Finally, the entire thing could be seen as one giant virtual layer with a very wide input and a very wide output, the concatenations of the initial recurrent state followed by all the word vectors, and the concatenations of all the tag probabilities followed by the final recurrent state.

This example only works for sequence classification. Recurring over a tree structure would need a different approach.

I've drawn up what I have in mind here:
https://drive.google.com/file/d/0B5t1j58WWjsiYTU1eHBWNFRnLXM/edit?usp=sharing

Another issue is that caffe seems to send training and test data through the layers as matrices of examples rather than individually, for performance reasons. So, with sentences of varying lengths it would probably be best to group them by word count and send these minibatches through appropriately instantiated recurrent nets.

Also, does caffe already have any support for converting text to word vectors for training data?

Weight Sharing

shelhamer mentioned this pull request Jun 26, 2014

Weight sharing #500

Closed

jeffdonahue and others added 2 commits June 26, 2014 12:41

weight sharing

41685ac

change weight blob field name to param

26e022a

shelhamer changed the title ~~weight sharing~~ Weight Sharing Jun 26, 2014

rename layer -> param mapping for clarity

a71354f

jeffdonahue added a commit that referenced this pull request Jun 26, 2014

Merge pull request #546 from BVLC/weight-sharing

3ab8fe7

Weight Sharing

jeffdonahue merged commit 3ab8fe7 into dev Jun 26, 2014

shelhamer deleted the weight-sharing branch June 27, 2014 02:20

jackculpepper mentioned this pull request Jul 7, 2014

learn a distance function suitable for use with hinge loss in siamese network #639

Closed

ducha-aiki mentioned this pull request Jul 11, 2014

Transposed weights sharing == tied weights for AE #670

Closed

ashafaei mentioned this pull request Aug 9, 2014

Siamese Networks / Distance Learning / Transfer Learning #697

Closed

nickcarlevaris mentioned this pull request Aug 21, 2014

Contrastive loss layer for training siamese nets #959

Merged

ashafaei reviewed Aug 28, 2014
View reviewed changes

shelhamer mentioned this pull request Sep 9, 2014

How do I specify the number of top blobs allocated for a DataLayer? #1057

Closed

mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014

Merge pull request BVLC#546 from BVLC/weight-sharing

897f72d

Weight Sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weight Sharing #546

Weight Sharing #546

shelhamer commented Jun 26, 2014

bhack commented Jun 26, 2014

shelhamer commented Jun 26, 2014

shelhamer commented Jun 26, 2014

bhack commented Jun 26, 2014

jeffdonahue commented Jun 26, 2014

ashafaei Aug 28, 2014

ashafaei Aug 28, 2014

abhi2610 Jul 22, 2015

shelhamer Aug 6, 2015

robertsdionne commented Sep 10, 2014

Weight Sharing #546

Weight Sharing #546

Conversation

shelhamer commented Jun 26, 2014

bhack commented Jun 26, 2014

shelhamer commented Jun 26, 2014

shelhamer commented Jun 26, 2014

bhack commented Jun 26, 2014

jeffdonahue commented Jun 26, 2014

ashafaei Aug 28, 2014

Choose a reason for hiding this comment

ashafaei Aug 28, 2014

Choose a reason for hiding this comment

abhi2610 Jul 22, 2015

Choose a reason for hiding this comment

shelhamer Aug 6, 2015

Choose a reason for hiding this comment

robertsdionne commented Sep 10, 2014