-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Parallel" option for training? Parallel adapter outputs required (without interacting with each other). #223
Comments
Hey @leejayyoon, I think you should be able to implement all of what you are trying to achieve out-of-the box. Just to clarify: what do you mean with "non-differentiable"? Do you want to freeze the parameters of one adapters and train the parameters of another adapter? In general I think you will want to use the Parallel functionality. If you add two adapters, the Given that you want a slightly more complicated setup you probably want something like this:
to only train the parameters of one adapter at a time. @calpt and @hSterz will be able to help you more on the implementation side. |
Hi @JoPfeiff, Thank you for your prompt response! By non-differentiable, I meant freezing just as you have interpreted. I see. I just wasn't sure whether Parallel functionality should be used as in the last line of this link, it says
Is this not the case anymore? |
I think the only reason is that it hasn't been tested in training scenarios yet. It should work though. |
OK, good to hear that from you. However, I would suggest having a unit test for this in near future to provide some assurance! What's the best way to do this? The straightforward way I can think of is training two adapters independently and check whether Parallel produces the same adapters? |
@JoPfeiff One last clarification question! In your example of freezing adapters 1 & 2, does Reading the code makes it clear, but wanted a confirmation to make it sure. |
checking in again to get answers to this question! Thank you for your time in advance! @JoPfeiff
|
|
I looked into it and the Parallel block should work for training out-of-the-box. Only if you want to use the Trainer class it does not work in the current version. The output with parallel adapters is currently just a list of the outputs of the parallel heads without a combined loss, but the trainer class requires such a loss. |
@hSterz Thanks for the reply. I am not sure if I fully understood what you meant.
|
Yes, currently there is only a separate backpropagatable loss for each adapter in the parallel block (which can't be handled by the Trainer class). But from what I understand this could be sufficient for your case. |
@hSterz I see. thanks for prompt response! (I somehow missed this notification on my email) It's nice that there isn't any expected problem. For additional background: I would be using it in the allenNLP framework, I don't know whether that would change the trainer story. |
Yes, we are planning to add unit tests for parallel blocks. I am not that familiar with the allenNLP framework. From a quick look, the trainer (like the GradientDescentTrainer) seems to have a similar problem to the adapter trainer because the model output has no aggregated loss attribute. |
@hSterz Thanks for patienly answering my questions. 👍 cheers! |
The unit tests are now merged into the master branch. |
Thank you @hSterz 👍 |
@hSterz I actually skimmed through some of the commits you made. When I look at test_adapter_compositions.py, it seems not only Parallel has gone through unit tests, now you can backpropagate to multiple adapters together, is this correct? |
Yes, a new MultiHeadOutput class is added which contains the sum of the individual losses of the heads. That allows us to backpropagate multiple parallel adapters together. |
Hello,
Thanks for this nice framework 👍 . I might be asking something that isn't yet possible but wanted to at least try asking!
I am trying to feed two BERT-based model's outputs to subsequent NN.
This requires having two BERT models to be loaded, however, the memory consumption becomes too high if I load two BERT models. To remedy this, I was wondering if I could do something like "Parallel" in training time.
(FYI, I am not trying to dynamically drop the first few layers and simply trying to create two BERT forward paths with lesser memory consumption)
I understand that active adapters can be switched by
set_active_adapters()
.(Actually, could you confirm if my understanding is correct?)
But, this doesn't seem to fit my purpose as, in my case, I need both adapters to output independent representation based on respective adapters.
Is there anyways that I can make adapters not interact with each other on the forward path while not loading original BERT parameters twice?
Any ideas perhaps? :)
The text was updated successfully, but these errors were encountered: