-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve / Fix Weight Sharing #1211
Comments
I just pushed a unit test for resuming from saved weights (4dc5bd0). It passes as expected, but fails when cherry-picked from 8dac339, before #594 was merged. Glad this was magically fixed, thanks @longjon! |
Would you consider the tied weights also? i have tried to implement them by myself, but with current weight sharing scheme it seemed too complicated. |
@ducha-aiki what is the difference between tied weights and shared weights ? @shelhamer I can look into dying if fillers are defined where parameters are shared; if you tell me what is the "caffe way of dying" (LOG(FATAL) and then ?). |
@rodrigob Tied weights are used in autoencoders. If encoder weights = W, then decoder weights = W^T, i.e transposed ones. |
While blobs can be shared permissively so that they have the same total cardinality but different dimensions this doesn't cover everything for W, W^T pairs since the input-output swapped inner product weights aren't the transpose. |
@shelhamer but the weights have different order in transposed matrix. I will check again, but when I have tried, that did not worked. |
Yeah, it would not work for pairs of inner product layers where the weights are transposed (using permissive would probably give very bad results). It would require a little bit of additional implementation -- probably the easiest would be to add a "transposed weights" option to the inner product layer so that the layer pair could use the same weight matrix. |
@jeffdonahue This is easy. The real problem are diffs, since they have not only different shape, but number of elements. |
What? Why would the diffs be a different number of elements? I think I'm missing something... |
@jeffdonahue Because size of diff == size of output. |
Right, the encode1 weights are 1000x784 (producing 1000D outputs from 784D inputs) and the decode1 weights have the transposed dimension, 784x1000 (producing 784D outputs from 1000D inputs). The weight gradients are the same dimension by definition. |
We should keep #1659 in mind too. |
Mocha has TiedInnerProductLayer [http://mochajl.readthedocs.org/en/latest/user-guide/layers/computation-layer.html#TiedInnerProductLayer, source: https://github.com/pluskid/Mocha.jl/blob/master/src/layers/tied-inner-product.jl], I guess Caffe could be similar, along the lines of @jeffdonahue suggestion to add a "transposed weights" option to the inner product layer. |
Do we have an update on these? Shared weights are very important for recurrent nets. |
Hi, Do we have an update on the 7th problem mentioned above? "Only the owner should initialize weights. Currently unnecessary work and memory is expended filling all weights, and then these are discarded to share with the weight owners." I am currently facing a problem having multiple FC layers sharing weights due to memory issue and I believe that it is due to the fact that even if I share weights between those FC layers, they are still being initialized and take extra memory at the creation of the network, any idea on workaround of this will be greatly appreciated! Thanks! |
Weight sharing as-is relies on a weight owner with which shared layers share their parameter blobs. This poses a few problems in relation to loss, loading and saving parameters, and weight initialization that are listed here for addressing.
@jeffdonahue @longjon
The text was updated successfully, but these errors were encountered: