Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral enablement. #120

Merged
merged 17 commits into from
Jun 11, 2024
Merged

Mixtral enablement. #120

merged 17 commits into from
Jun 11, 2024

Conversation

wang2yn84
Copy link
Collaborator

Mixtral 8x7b model is working for both offline and online, bf16 and int8. Let's get this in first so we can parallelize the work. Will add tests in the coming PRs.

@qihqi
Copy link
Collaborator

qihqi commented Jun 10, 2024

please make sure the name is mixtral and not mistral. We might add mistral 7b ( the non-Moe version) later, so it would be confusing

README.md Outdated
## Run weight safetensor convert

```bash
export input_ckpt_dir=Original llama weights directory
export output_ckpt_dir=The output directory
export model_name="llama-3" # or "llama-2", "gemma"
export model_name="llama-3" # or "llama-2", "gemma", "mistral"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to mixtral

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I was confused about the name initially and that's why there are mixes of mistral and mixtral. I also changes everything to Mixtral. Done.

torch.empty(config.num_experts, config.intermediate_size, config.dim)
)

def forward(self, x: Tensor, expert_indices: Tensor) -> Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a change to use different logic for longer seqlen and i pushed to your branch, is that lost from merging?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the quantized change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original model. Your changes are in model.py

@qihqi qihqi self-requested a review June 10, 2024 21:51
Copy link
Collaborator

@FanhaiLu1 FanhaiLu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for add Mixtral, the code is clean and overlay look good!

"layers.{}.attention.wk.weight": "layers.{}.attention.wk.weight",
"layers.{}.attention.wv.weight": "layers.{}.attention.wv.weight",
"layers.{}.attention.wo.weight": "layers.{}.attention.wo.weight",
"layers.{}.block_sparse_moe.w1": "layers.{}.block_sparse_moe.cond_ffn.w1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like only these weight name are difference, can we only store the the different name in the map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed

@wang2yn84 wang2yn84 merged commit d6bf068 into main Jun 11, 2024
4 checks passed
@qihqi qihqi deleted the mixtral branch July 15, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants