-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixtral enablement. #120
Mixtral enablement. #120
Conversation
…t's moving. But the outputs doesn't make sense yet because weights are not loaded yet.
…verter with qkv fusion.
…or loading pth file.
quantization
please make sure the name is mixtral and not mistral. We might add mistral 7b ( the non-Moe version) later, so it would be confusing |
README.md
Outdated
## Run weight safetensor convert | ||
|
||
```bash | ||
export input_ckpt_dir=Original llama weights directory | ||
export output_ckpt_dir=The output directory | ||
export model_name="llama-3" # or "llama-2", "gemma" | ||
export model_name="llama-3" # or "llama-2", "gemma", "mistral" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to mixtral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I was confused about the name initially and that's why there are mixes of mistral and mixtral. I also changes everything to Mixtral. Done.
torch.empty(config.num_experts, config.intermediate_size, config.dim) | ||
) | ||
|
||
def forward(self, x: Tensor, expert_indices: Tensor) -> Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a change to use different logic for longer seqlen and i pushed to your branch, is that lost from merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the quantized change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original model. Your changes are in model.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for add Mixtral, the code is clean and overlay look good!
"layers.{}.attention.wk.weight": "layers.{}.attention.wk.weight", | ||
"layers.{}.attention.wv.weight": "layers.{}.attention.wv.weight", | ||
"layers.{}.attention.wo.weight": "layers.{}.attention.wo.weight", | ||
"layers.{}.block_sparse_moe.w1": "layers.{}.block_sparse_moe.cond_ffn.w1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like only these weight name are difference, can we only store the the different name in the map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, removed
…tral checkpoints.
Mixtral 8x7b model is working for both offline and online, bf16 and int8. Let's get this in first so we can parallelize the work. Will add tests in the coming PRs.