"Parallel" option for training? Parallel adapter outputs required (without interacting with each other). #223

leejayyoon · 2021-08-24T07:23:39Z

Hello,

Thanks for this nice framework 👍 . I might be asking something that isn't yet possible but wanted to at least try asking!

I am trying to feed two BERT-based model's outputs to subsequent NN.
This requires having two BERT models to be loaded, however, the memory consumption becomes too high if I load two BERT models. To remedy this, I was wondering if I could do something like "Parallel" in training time.
(FYI, I am not trying to dynamically drop the first few layers and simply trying to create two BERT forward paths with lesser memory consumption)

I understand that active adapters can be switched by set_active_adapters().
(Actually, could you confirm if my understanding is correct?)
But, this doesn't seem to fit my purpose as, in my case, I need both adapters to output independent representation based on respective adapters.

Is there anyways that I can make adapters not interact with each other on the forward path while not loading original BERT parameters twice?

Making this question even more complex, I also need to make one adapter's parameters to be non-differentiable while requiring them in the forward loop.
Any ideas perhaps? :)

The text was updated successfully, but these errors were encountered:

JoPfeiff · 2021-08-24T15:08:48Z

Hey @leejayyoon,

I think you should be able to implement all of what you are trying to achieve out-of-the box.

Just to clarify: what do you mean with "non-differentiable"? Do you want to freeze the parameters of one adapters and train the parameters of another adapter?

In general I think you will want to use the Parallel functionality. If you add two adapters, the Parallel functionality will loop through the "grouped" batch, and only pass the examples through their respective adapters. The output representations will correspondingly be completely independent of the respective other adapter.

Given that you want a slightly more complicated setup you probably want something like this:
model.active_adapters = ac.Parallel(adapter1, adapter2)
during training you can iterated through

# only finetune adapter1, freezing adapter 2
model.train_adapter(adapter1)
...
# only finetune adapter2, freezing adapter 1
model.train_adapter(adapter2)

to only train the parameters of one adapter at a time.

@calpt and @hSterz will be able to help you more on the implementation side.

leejayyoon · 2021-08-24T15:26:42Z

Hi @JoPfeiff,

Thank you for your prompt response!

By non-differentiable, I meant freezing just as you have interpreted.

I see. I just wasn't sure whether Parallel functionality should be used as in the last line of this link, it says

Note that the Parallel block is only intended for inference, not for training adapters.

Is this not the case anymore?

JoPfeiff · 2021-08-24T17:11:53Z

I think the only reason is that it hasn't been tested in training scenarios yet. It should work though.

leejayyoon · 2021-08-24T18:43:50Z

OK, good to hear that from you. However, I would suggest having a unit test for this in near future to provide some assurance!

What's the best way to do this? The straightforward way I can think of is training two adapters independently and check whether Parallel produces the same adapters?
(It would require "identical" initialization rather than with the same random seed? Or do you guys have ways to initialize to some fixed value for testing?)

leejayyoon · 2021-08-25T04:39:03Z

@JoPfeiff One last clarification question!

In your example of freezing adapters 1 & 2, does model.train_adapter(adapter_name) only freeze pretrained model's (e.g. BERT) parameters and do not affect the rest of the computation graph at all?

Reading the code makes it clear, but wanted a confirmation to make it sure.
In train_adapter()--> self.freeeze_model(True) this line of code,
https://github.com/Adapter-Hub/adapter-transformers/blob/82a3d80a98d33610f38746ff72344c9d2fd66336/src/transformers/adapters/model_mixin.py#L515
, I am assuming self.base_model refers to the pretrained model part.

leejayyoon · 2021-08-27T03:29:54Z

@calpt and @hSterz

Would you have any answers to my question above? Also, do you guys plan to conduct a unit test for Parallel functionality at some point?

leejayyoon · 2021-08-31T15:50:23Z

checking in again to get answers to this question! Thank you for your time in advance! @JoPfeiff

@JoPfeiff One last clarification question!

In your example of freezing adapters 1 & 2, does model.train_adapter(adapter_name) only freeze pretrained model's (e.g. BERT) parameters and do not affect the rest of the computation graph at all?

Reading the code makes it clear, but wanted a confirmation to make it sure.
In train_adapter()--> self.freeeze_model(True) this line of code,
https://github.com/Adapter-Hub/adapter-transformers/blob/82a3d80a98d33610f38746ff72344c9d2fd66336/src/transformers/adapters/model_mixin.py#L515

, I am assuming self.base_model refers to the pretrained model part.

JoPfeiff · 2021-08-31T16:33:53Z

model.train_adapter([ada1, ada2]) first freezes all parameters in the module base_model and then reactivates all adapters in the list, in this case ada1 and ada2.
base_model includes all transformer weights as well as the embedding layer. It does not include the prediction head.

hSterz · 2021-08-31T21:29:05Z

I looked into it and the Parallel block should work for training out-of-the-box. Only if you want to use the Trainer class it does not work in the current version. The output with parallel adapters is currently just a list of the outputs of the parallel heads without a combined loss, but the trainer class requires such a loss.

leejayyoon · 2021-09-01T14:32:06Z

@hSterz Thanks for the reply.

I am not sure if I fully understood what you meant.
Are you saying there needs to be an extra modification on the Parallel block for it to have a backpropagatable loss function?
For my usage, I don't need to update two adapters simultaneously, rather, I need to update them in alternating fashion.

I looked into it and the Parallel block should work for training out-of-the-box. Only if you want to use the Trainer class it does not work in the current version. The output with parallel adapters is currently just a list of the outputs of the parallel heads without a combined loss, but the trainer class requires such a loss.

hSterz · 2021-09-01T18:03:20Z

Yes, currently there is only a separate backpropagatable loss for each adapter in the parallel block (which can't be handled by the Trainer class). But from what I understand this could be sufficient for your case.

leejayyoon · 2021-09-04T22:57:25Z

@hSterz I see. thanks for prompt response! (I somehow missed this notification on my email)

It's nice that there isn't any expected problem.
I asked this earlier, but are you guys planning any unit test on the training with parallel block perhaps?

For additional background: I would be using it in the allenNLP framework, I don't know whether that would change the trainer story.

hSterz · 2021-09-05T14:06:17Z

Yes, we are planning to add unit tests for parallel blocks.

I am not that familiar with the allenNLP framework. From a quick look, the trainer (like the GradientDescentTrainer) seems to have a similar problem to the adapter trainer because the model output has no aggregated loss attribute.

leejayyoon · 2021-09-07T19:21:20Z

@hSterz Thanks for patienly answering my questions. 👍
I'll keep an eye for unit tests. Even better if you can notify here.

cheers!

hSterz · 2021-09-14T12:20:00Z

The unit tests are now merged into the master branch.

leejayyoon · 2021-09-15T02:59:43Z

Thank you @hSterz 👍

leejayyoon · 2021-09-15T04:21:32Z

@hSterz I actually skimmed through some of the commits you made. When I look at test_adapter_compositions.py, it seems not only Parallel has gone through unit tests, now you can backpropagate to multiple adapters together, is this correct?

hSterz · 2021-09-15T07:36:21Z

Yes, a new MultiHeadOutput class is added which contains the sum of the individual losses of the heads. That allows us to backpropagate multiple parallel adapters together.

leejayyoon added the question Further information is requested label Aug 24, 2021

hSterz linked a pull request Aug 31, 2021 that will close this issue

Parallel training #226

Merged

calpt assigned hSterz Sep 3, 2021

hSterz closed this as completed in #226 Sep 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Parallel" option for training? Parallel adapter outputs required (without interacting with each other). #223

"Parallel" option for training? Parallel adapter outputs required (without interacting with each other). #223

leejayyoon commented Aug 24, 2021

JoPfeiff commented Aug 24, 2021

leejayyoon commented Aug 24, 2021 •

edited

Loading

JoPfeiff commented Aug 24, 2021

leejayyoon commented Aug 24, 2021 •

edited

Loading

leejayyoon commented Aug 25, 2021

leejayyoon commented Aug 27, 2021 •

edited

Loading

leejayyoon commented Aug 31, 2021

JoPfeiff commented Aug 31, 2021

hSterz commented Aug 31, 2021

leejayyoon commented Sep 1, 2021

hSterz commented Sep 1, 2021

leejayyoon commented Sep 4, 2021

hSterz commented Sep 5, 2021

leejayyoon commented Sep 7, 2021

hSterz commented Sep 14, 2021

leejayyoon commented Sep 15, 2021

leejayyoon commented Sep 15, 2021

hSterz commented Sep 15, 2021

"Parallel" option for training? Parallel adapter outputs required (without interacting with each other). #223

"Parallel" option for training? Parallel adapter outputs required (without interacting with each other). #223

Comments

leejayyoon commented Aug 24, 2021

JoPfeiff commented Aug 24, 2021

leejayyoon commented Aug 24, 2021 • edited Loading

JoPfeiff commented Aug 24, 2021

leejayyoon commented Aug 24, 2021 • edited Loading

leejayyoon commented Aug 25, 2021

leejayyoon commented Aug 27, 2021 • edited Loading

leejayyoon commented Aug 31, 2021

JoPfeiff commented Aug 31, 2021

hSterz commented Aug 31, 2021

leejayyoon commented Sep 1, 2021

hSterz commented Sep 1, 2021

leejayyoon commented Sep 4, 2021

hSterz commented Sep 5, 2021

leejayyoon commented Sep 7, 2021

hSterz commented Sep 14, 2021

leejayyoon commented Sep 15, 2021

leejayyoon commented Sep 15, 2021

hSterz commented Sep 15, 2021

leejayyoon commented Aug 24, 2021 •

edited

Loading

leejayyoon commented Aug 24, 2021 •

edited

Loading

leejayyoon commented Aug 27, 2021 •

edited

Loading