-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] W8A16 Int8 inside FusedMoE #7415
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
/ready |
How to convert models to ExpertsInt8? |
Hey - just FYI there are some ongoing efforts to extend Marlin to support W4A16 and W8A16 Right now the kernels load GPTQ models, but we could really connect them to any models type We should run benchmarks against these as well in deciding which kernel to use |
You would just need to run vLLM with |
I understand, I was trying to benchmark this method against the PR #7079 on @robertgshaw2-neuralmagic Do you think this is a blocker for merging this PR? We can have the two options available. |
a2fb828
to
2564979
Compare
It seems that #6502 is also a similar PR. |
8b00352
to
c12635c
Compare
I tested this pull request on deepseek-v2-chat-236B. Indeed more concurrency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick changes! I think these are my last round of comments
Thank you for the thorough review! I've changed the terms |
CI failures seem to not be related to this PR |
else: | ||
raise ValueError( | ||
f"Shard id must be in [0,1,2] but got {shard_id}") | ||
weight_loader(param, loaded_weight, weight_name, shard_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an fyi - we're updating/expanding the weight loading for Fused MoE layers: #7527
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Ditto with Dipika that it'd be good to make this work with the MoE Parameter refactor eventually
Thanks! I'll rebase, maybe will resolve the CI issues |
Signed-off-by: Alvant <[email protected]>
🚀 The feature, motivation and pitch
Additional feature for
fused_moe
triton kernel to support W8A16 with Int8, supports Ampere/Ada lovelace/Hopper, called ExpertsInt8.Based on symmetric per-column per-expert Int8 quantization, casting the weights to FP16/BF16 before matmul inside the
fused_moe
kernel (compute_type in FP/BF16).Support quantization and scales extraction on startup (takes 1min on Jamba).
We've ran quality benchmarks on Jamba and it shows no quality degradation:
Performance:
E2E latency in seconds on requests with Prompt length=1024, decode length=128:
Advantages: