-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Electra #400
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @amitkumarj441, thanks a lot for working on this! While there are still a few tests failing currently, I did a partial review of your changes and left some comments. All in all, it looks very good.
Besides fixing the missing tests, please have look at our contribution guide for the required documentation steps for a new model. Let me know if anything is unclear or you need any assistance from our side!
self.add_prediction_head(head, overwrite_ok=overwrite_ok) | ||
|
||
|
||
class ElectraModelWithHeads(ElectraAdapterModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model classes of the form XModelWithHeads are deprecated, so we don't want to add those classes for newly supported architectures. ElectraAdapterModel
should be used for all cases. Please remove this class.
@@ -401,6 +401,16 @@ | |||
}, | |||
"layers": {"classifier"}, | |||
}, | |||
# Electra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since Electra also provides other task-specific model classes (e.g. ElectraForTokenClassification
) in its modeling file, it would be great to also have conversations for those here.
from ..model_mixin import InvertibleAdaptersMixin, ModelAdaptersMixin | ||
|
||
|
||
# For backwards compatibility, ElectraSelfOutput inherits directly from AdapterLayer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# For backwards compatibility, ElectraSelfOutput inherits directly from AdapterLayer |
super().__init__("mh_adapter", None) | ||
|
||
|
||
# For backwards compatibility, ElectraOutput inherits directly from AdapterLayer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# For backwards compatibility, ElectraOutput inherits directly from AdapterLayer |
Thanks @calpt for reviewing this PR. I will make changes as suggested soon. |
Any update soon? |
self.value = nn.Linear(config.hidden_size, self.all_head_size) | ||
self.query = LoRALinear(config.hidden_size, self.all_head_size, "selfattn", config) | ||
self.key = LoRALinear(config.hidden_size, self.all_head_size, "selfattn", config) | ||
self.value = LoRALinear(config.hidden_size, self.all_head_size, "selfattn", config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In BERT there is the attn_key param
self.query = LoRALinear(config.hidden_size, self.all_head_size, "selfattn", config, attn_key="q")
self.key = LoRALinear(config.hidden_size, self.all_head_size, "selfattn", config, attn_key="k")
self.value = LoRALinear(config.hidden_size, self.all_head_size, "selfattn", config, attn_key="v")
@@ -295,6 +306,7 @@ def forward( | |||
# if encoder bi-directional self-attention `past_key_value` is always `None` | |||
past_key_value = (key_layer, value_layer) | |||
|
|||
key_layer, value_layer, attention_mask = self.prefix_tuning(key_layer, value_layer, attention_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key_layer, value_layer, attention_mask = self.prefix_tuning(key_layer, value_layer, hidden_states, attention_mask)
Missing hidden_states
param maybe?
Closing in favor of #583. |
Support architecture proposed in Electra: Pre-training text encoders as discriminators rather than generators