-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marlin symmetric quantization and inference #320
Conversation
Perplexity results Symmetric AWQ Marlin model:
Zero Point AWQ model:
|
updated with new marlin Perf BenchBatch Size = 1GEMM
Marlin
ExllamaV2
Batch Size = 8GEMM
Marlin
ExllamaV2
|
nice work. may the architecture of the GPU impact things? |
@vince62s I'd say "definitely" based on the fact that the kernel has many PTX assembly blocks and a hard constraint on architecture, from the kernel's repo https://github.com/IST-DASLab/marlin
|
looking forward seeing numbers at batch_size even higher 32/64 which might be reasonable for seq len 1024/2048 when Marlin is optimized for Hopper. |
Looks good to me! Fixed a small bug with the workspace after the latest update to Marlin. Nice to have a refactor of the Quantizer as well. |
@casper-hansen awesome! apologies for not cleaning up the PR myself 😅 thanks for taking care of it 🙏 |
Great work adapting I'm currently looking to do the same -- that is adapt optimized inference kernels for different quantization formats. Roughly, what are the major changes that need to be made to adapt a quantization format in order to use a kernel such as Marlin? Specifically, how do the quantized weights, scales, and zeros need to be preprocessed in order to conform to the required layout for E.g., starting from 4-bit quantized weights packed |
with @casper-hansen 🤗
experimental, still needs cleanup.