Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

activation function in BERTIntermediate #17

Merged
merged 9 commits into from
Nov 13, 2018
Merged

activation function in BERTIntermediate #17

merged 9 commits into from
Nov 13, 2018

Conversation

lukovnikov
Copy link
Contributor

Was previously hardcoded to gelu because pretrained BERT models use gelu.
Changed to make BERTIntermediate use functions and "gelu", "relu" or "swish" from config.

@thomwolf thomwolf merged commit 8513741 into huggingface:master Nov 13, 2018
@thomwolf
Copy link
Member

Looks good, thanks for that!

qwang70 pushed a commit to DRL36/pytorch-pretrained-BERT that referenced this pull request Mar 2, 2019
activation function in BERTIntermediate
@HongyanJiao HongyanJiao mentioned this pull request Sep 19, 2019
stevezheng23 added a commit to stevezheng23/transformers that referenced this pull request Mar 24, 2020
fix at issues in roberta/berta modeling
amathews-amd referenced this pull request in ROCm/transformers Aug 6, 2021
jlamypoirier added a commit to jlamypoirier/transformers that referenced this pull request Apr 4, 2023
jameshennessytempus pushed a commit to jameshennessytempus/transformers that referenced this pull request Jun 1, 2023
jonb377 pushed a commit to jonb377/hf-transformers that referenced this pull request Nov 3, 2023
Summary:
This pull request adds 2D SPMD sharding to the table. It will shard both weights and activations. Here is the sharding strategy.

Let's say we have a 2D mesh (data, model) and data x model == num_devices:
1. input (data,, None, model)
2. embedding (model, data)
3. attn QKV (data, model)
4. attn O (model, data)
5. mlp gate, up (model, data)
6. mlp down (data, model)
7. activation (data,, None, model)
Currently you can specify the model dimension using a new option --spmd_2d_sharding, then the data dimension will be auto-calculated.

TODO: maybe we should have another option to specify whether or not we should shard the activations/inputs or shard them differently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants