Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Parallel" option for training? Parallel adapter outputs required (without interacting with each other). #223

Closed
leejayyoon opened this issue Aug 24, 2021 · 18 comments · Fixed by #226
Assignees
Labels
question Further information is requested

Comments

@leejayyoon
Copy link

Hello,

Thanks for this nice framework 👍 . I might be asking something that isn't yet possible but wanted to at least try asking!

I am trying to feed two BERT-based model's outputs to subsequent NN.
This requires having two BERT models to be loaded, however, the memory consumption becomes too high if I load two BERT models. To remedy this, I was wondering if I could do something like "Parallel" in training time.
(FYI, I am not trying to dynamically drop the first few layers and simply trying to create two BERT forward paths with lesser memory consumption)

I understand that active adapters can be switched by set_active_adapters().
(Actually, could you confirm if my understanding is correct?)
But, this doesn't seem to fit my purpose as, in my case, I need both adapters to output independent representation based on respective adapters.

Is there anyways that I can make adapters not interact with each other on the forward path while not loading original BERT parameters twice?

  • Making this question even more complex, I also need to make one adapter's parameters to be non-differentiable while requiring them in the forward loop.
    Any ideas perhaps? :)
@leejayyoon leejayyoon added the question Further information is requested label Aug 24, 2021
@JoPfeiff
Copy link
Member

Hey @leejayyoon,

I think you should be able to implement all of what you are trying to achieve out-of-the box.

Just to clarify: what do you mean with "non-differentiable"? Do you want to freeze the parameters of one adapters and train the parameters of another adapter?

In general I think you will want to use the Parallel functionality. If you add two adapters, the Parallel functionality will loop through the "grouped" batch, and only pass the examples through their respective adapters. The output representations will correspondingly be completely independent of the respective other adapter.

Given that you want a slightly more complicated setup you probably want something like this:
model.active_adapters = ac.Parallel(adapter1, adapter2)
during training you can iterated through

# only finetune adapter1, freezing adapter 2
model.train_adapter(adapter1)
...
# only finetune adapter2, freezing adapter 1
model.train_adapter(adapter2)

to only train the parameters of one adapter at a time.

@calpt and @hSterz will be able to help you more on the implementation side.

@leejayyoon
Copy link
Author

leejayyoon commented Aug 24, 2021

Hi @JoPfeiff,

Thank you for your prompt response!

By non-differentiable, I meant freezing just as you have interpreted.

I see. I just wasn't sure whether Parallel functionality should be used as in the last line of this link, it says

Note that the Parallel block is only intended for inference, not for training adapters.

Is this not the case anymore?

@JoPfeiff
Copy link
Member

I think the only reason is that it hasn't been tested in training scenarios yet. It should work though.

@leejayyoon
Copy link
Author

leejayyoon commented Aug 24, 2021

OK, good to hear that from you. However, I would suggest having a unit test for this in near future to provide some assurance!

What's the best way to do this? The straightforward way I can think of is training two adapters independently and check whether Parallel produces the same adapters?
(It would require "identical" initialization rather than with the same random seed? Or do you guys have ways to initialize to some fixed value for testing?)

@leejayyoon
Copy link
Author

@JoPfeiff One last clarification question!

In your example of freezing adapters 1 & 2, does model.train_adapter(adapter_name) only freeze pretrained model's (e.g. BERT) parameters and do not affect the rest of the computation graph at all?

Reading the code makes it clear, but wanted a confirmation to make it sure.
In train_adapter()--> self.freeeze_model(True) this line of code,
https://github.com/Adapter-Hub/adapter-transformers/blob/82a3d80a98d33610f38746ff72344c9d2fd66336/src/transformers/adapters/model_mixin.py#L515
, I am assuming self.base_model refers to the pretrained model part.

@leejayyoon
Copy link
Author

leejayyoon commented Aug 27, 2021

@calpt and @hSterz

Would you have any answers to my question above? Also, do you guys plan to conduct a unit test for Parallel functionality at some point?

@leejayyoon
Copy link
Author

checking in again to get answers to this question! Thank you for your time in advance! @JoPfeiff

@JoPfeiff One last clarification question!

In your example of freezing adapters 1 & 2, does model.train_adapter(adapter_name) only freeze pretrained model's (e.g. BERT) parameters and do not affect the rest of the computation graph at all?

Reading the code makes it clear, but wanted a confirmation to make it sure.
In train_adapter()--> self.freeeze_model(True) this line of code,
https://github.com/Adapter-Hub/adapter-transformers/blob/82a3d80a98d33610f38746ff72344c9d2fd66336/src/transformers/adapters/model_mixin.py#L515

, I am assuming self.base_model refers to the pretrained model part.

@JoPfeiff
Copy link
Member

model.train_adapter([ada1, ada2]) first freezes all parameters in the module base_model and then reactivates all adapters in the list, in this case ada1 and ada2.
base_model includes all transformer weights as well as the embedding layer. It does not include the prediction head.

@hSterz hSterz linked a pull request Aug 31, 2021 that will close this issue
@hSterz
Copy link
Member

hSterz commented Aug 31, 2021

I looked into it and the Parallel block should work for training out-of-the-box. Only if you want to use the Trainer class it does not work in the current version. The output with parallel adapters is currently just a list of the outputs of the parallel heads without a combined loss, but the trainer class requires such a loss.

@leejayyoon
Copy link
Author

@hSterz Thanks for the reply.

I am not sure if I fully understood what you meant.
Are you saying there needs to be an extra modification on the Parallel block for it to have a backpropagatable loss function?
For my usage, I don't need to update two adapters simultaneously, rather, I need to update them in alternating fashion.

I looked into it and the Parallel block should work for training out-of-the-box. Only if you want to use the Trainer class it does not work in the current version. The output with parallel adapters is currently just a list of the outputs of the parallel heads without a combined loss, but the trainer class requires such a loss.

@hSterz
Copy link
Member

hSterz commented Sep 1, 2021

Yes, currently there is only a separate backpropagatable loss for each adapter in the parallel block (which can't be handled by the Trainer class). But from what I understand this could be sufficient for your case.

@leejayyoon
Copy link
Author

@hSterz I see. thanks for prompt response! (I somehow missed this notification on my email)

It's nice that there isn't any expected problem.
I asked this earlier, but are you guys planning any unit test on the training with parallel block perhaps?

For additional background: I would be using it in the allenNLP framework, I don't know whether that would change the trainer story.

@hSterz
Copy link
Member

hSterz commented Sep 5, 2021

Yes, we are planning to add unit tests for parallel blocks.

I am not that familiar with the allenNLP framework. From a quick look, the trainer (like the GradientDescentTrainer) seems to have a similar problem to the adapter trainer because the model output has no aggregated loss attribute.

@leejayyoon
Copy link
Author

@hSterz Thanks for patienly answering my questions. 👍
I'll keep an eye for unit tests. Even better if you can notify here.

cheers!

@hSterz
Copy link
Member

hSterz commented Sep 14, 2021

The unit tests are now merged into the master branch.

@leejayyoon
Copy link
Author

Thank you @hSterz 👍

@leejayyoon
Copy link
Author

@hSterz I actually skimmed through some of the commits you made. When I look at test_adapter_compositions.py, it seems not only Parallel has gone through unit tests, now you can backpropagate to multiple adapters together, is this correct?

@hSterz
Copy link
Member

hSterz commented Sep 15, 2021

Yes, a new MultiHeadOutput class is added which contains the sum of the individual losses of the heads. That allows us to backpropagate multiple parallel adapters together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants