Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weight Sharing #546

Merged
merged 3 commits into from
Jun 26, 2014
Merged

Weight Sharing #546

merged 3 commits into from
Jun 26, 2014

Conversation

shelhamer
Copy link
Member

[Original PR notes copied from #500. This PR replaces it.]

This adds the ability to share parameters between layers, which has a number of applications, the canonical one perhaps being recurrent neural network (RNN) training.

To share weights between two or more layers with parameters (currently just InnerProductLayers and ConvolutionLayers), specify the same param for all of these layers. (You can also name the biases with a second param, as in the blobs_lr and weight_decay parameters.) You can see a very simple example of this in src/caffe/test/test_net.cpp: see the unit test named InitDiffDataSharedWeightsNet:

layers: {
  name: 'innerproduct1'
  type: INNER_PRODUCT
  inner_product_param {
    num_output: 10
    bias_term: false
    weight_filler {
      type: 'gaussian'
      std: 10
    }
  }
  param: 'sharedweights'
  bottom: 'data'
  top: 'innerproduct1'
}
layers: {
  name: 'innerproduct2'
  type: INNER_PRODUCT
  inner_product_param {
    num_output: 10
    bias_term: false
    weight_filler {
      type: 'gaussian'
      std: 10
    }
  }
  param: 'sharedweights'
  bottom: 'data'
  top: 'innerproduct2'
}

This means layers innerproduct1 and innerproduct2 are sharing the same set of weights as they've both specified param: 'sharedweights'. And in this case they also take the same bottom blob, (data), so their outputs, top blobs innerproduct1 and innerproduct2, should be identical (so this is not actually something you'd ever want to do; I do it there just for testing purposes).

Note that in this case we specify only one blob name because we've set bias_term: false; if we didn't have bias_term: false we'd need to specify two params, but probably the second one should be empty unless we actually want to share biases. (Specifying the empty string as a param is equivalent to not specifying a param in my implementation.)

param: 'sharedweights'
param: ''

The entire implementation is in Net::Init, Net::AppendParam, and Net::Update. Init figures out which layer will actually "own" the shared param (the first one to list its param), and Update adds the non-owned layers' computed diffs into the diff of the owner blob, then only actually performs updates on owned blobs. Memory-wise, all shared blobs actually point to the same memory location for the parameter's data, but still have separately allocated diff blobs, as the logic to handle learning rate, weight decay, etc is still handled by the Solver (which is blissfully unaware that parameters can be shared).

Open to hearing feedback on the interface, implementation, etc. I'm not sure I'm happy with param as the name of the field, I think it would be less ambiguous to use param_name or something, but would be inconsistent with the other per-parameter field blobs_lr (and actually to be consistent with that it should be blobs_name, but I strongly prefer the singular here..).

@shelhamer shelhamer mentioned this pull request Jun 26, 2014
@bhack
Copy link
Contributor

bhack commented Jun 26, 2014

Have this any relation with Composite layer with routing capabilities?

@shelhamer
Copy link
Member Author

@bhack Not that I can see, although I could have missed it since I've only just taken a cursory glance at the Composite layer.

Caffe already understands DAG models by inserting split layers at forks where a top blob is the bottom blob of more than one layer. With shared weights, one can do the same convolutions, inner products, or whatever on different inputs by defining the layer with the same param name where desired. In this way a composite layer is not needed.

That said, it would be nice to have a shorthand in our model definitions with this kind of structure instead of redefining the shared layer over and over with different bottoms and tops. A multiresolution model is a good example: the same convolutions should be done at every level of the pyramid, and it would be more concise to not write this down exhaustively.

The Composite layer raises another point: Caffe could execute layers at the same topological depth simultaneously. At the moment execution is totally serial.

@shelhamer shelhamer changed the title weight sharing Weight Sharing Jun 26, 2014
@shelhamer
Copy link
Member Author

I made the changes I suggested at #500 (comment).

@jeffdonahue please review -- if you don't like my follow-up renaming commits, feel free to drop them.

This is otherwise ready for merge IMHO.

@bhack
Copy link
Contributor

bhack commented Jun 26, 2014

Yes with shared weight it think we cover almost all. I agree with you that composite layer other that parallel execution let also a more simple network notation in yaml that probably could be adopted in some form in caffe protobuf.

@jeffdonahue
Copy link
Contributor

Cool, thanks for the rebase and name cleanup!

LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode();
}
}
// Now, update the owned parameters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shelhamer, If I understood correctly, you add up all the diffs to the owner_diff and then update the parameters accordingly, right? Does it imply if the layer owning this blob is not contributing to the loss, the updated diff from other layers that use this blob but don't own it, will not be used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We either must change the ownership from the first layer that mentions the param to the first layer that mentions the param and participates in the loss, or we need to fix the layer_need_backward_[layer_id] accordingly. Is my concern valid?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashafaei Was this fixed in the current version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashafaei @abhi2610 This was never actually an issue although it was worth raising since you need to know how the Net, Solver and Blobs cooperate. The loss / backward logic only decides whether backward is computed for a layer or not. Net::Update() is always called by the solver, all of the shared weight params will accumulate their diff with the owner in the loop at 478, and then Blob::Update() will always be called for every weight owner as in the loop at 501. This does bring up why Blob::Update() is unconditional when it could be skipped but that's another matter. Thanks @jeffdonahue for discussion.

@robertsdionne
Copy link

I see that this pull request adds weight sharing, which recurrent nets will need, however caffe takes fixed protocol buffer net descriptors. What would be the best way to implement a mechanism to allow dynamic repetition of sets of layers for input sequences?

For instance, if I want to train a recurrent net for part of speech tagging, for a given sentence with n words, I'll have the following inputs and the following outputs:

inputs
r_0 : the initial recurrent state
x_1: the word vector for the first word
x_2: the second word vector
...
x_n: the nth word vector

outputs
t_1: the tag assigned to the first word
t_2: the second tag
...
t_n: the nth tag
r_n+1: the final recurrent state

If the recurrent part has several layers that are repeated for each input word, such as a dropout layer, a concatenation layer, an inner product layer for the recurrent output, a rectified linear layer for the recurrent output, an inner product layer for the classification output, and a softmax layer for the classification output, you can group these into virtual layers that take the following inputs and outputs:

inputs
r_i-1: the previous recurrent state
x_i: the current word vector

outputs
t_i: the current tag
r_i: the current recurrent state

Then, for a given sentence, you could chain together several of these modules made of several layers to feed the recurrent state from one to the next. The repeated configuration could be described by a protocol buffer.

Finally, the entire thing could be seen as one giant virtual layer with a very wide input and a very wide output, the concatenations of the initial recurrent state followed by all the word vectors, and the concatenations of all the tag probabilities followed by the final recurrent state.

This example only works for sequence classification. Recurring over a tree structure would need a different approach.

I've drawn up what I have in mind here:
https://drive.google.com/file/d/0B5t1j58WWjsiYTU1eHBWNFRnLXM/edit?usp=sharing

Another issue is that caffe seems to send training and test data through the layers as matrices of examples rather than individually, for performance reasons. So, with sentences of varying lengths it would probably be best to group them by word count and send these minibatches through appropriately instantiated recurrent nets.

Also, does caffe already have any support for converting text to word vectors for training data?

mitmul pushed a commit to mitmul/caffe that referenced this pull request Sep 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants