Releases: casper-hansen/AutoAWQ
Releases · casper-hansen/AutoAWQ
v0.2.6
What's Changed
- Cohere Support by @TechxGenus in #457
- Add phi3 support by @pprp in #481
- Support Weight-Only quantization on CPU device with QBits backend by @PenghuiCheng in #437
- Fix typo by @wanyaworld in #486
- Add updates + sponsorship by @casper-hansen in #495
- Update README.md by @casper-hansen in #497
- Update doc by @imba-tjd in #499
- add support for Openbmb/MiniCPM by @LDLINGLINGLING in #504
- Update RunPod support by @casper-hansen in #514
- add deepseek v2 support by @TechxGenus in #508
- nan problem of Qwen2-72B quantization by @baoyf4244 in #519
- Qwen nan fix by @baoyf4244 in #522
- fix deepseek v2 input feat by @TechxGenus in #524
- Batched quantization by @casper-hansen in #516
- Fix step size when computing clipping by @casper-hansen in #531
- Pin torch version to 2.3.1 by @devin-ai-integration in #542
- Revert "Pin torch version to 2.3.1 (#542)" by @casper-hansen in #547
- CLI example + Runpod launch script by @casper-hansen in #548
- Print warning if AutoAWQ cannot load extensions by @casper-hansen in #515
- Remove progress bars by @casper-hansen in #550
- Add test for chunked methods by @casper-hansen in #551
- Llama with inputs_embeds only(LLava-v1.5 bug fixed) and Llava-v1.6 Support by @WanBenLe in #471
- Better CLI + RunPod Script by @casper-hansen in #552
- Release 026 by @casper-hansen in #546
- pin torch==2.3.1 by @casper-hansen in #554
- Remove ROCm build and only build for PyPi by @casper-hansen in #555
New Contributors
- @pprp made their first contribution in #481
- @PenghuiCheng made their first contribution in #437
- @wanyaworld made their first contribution in #486
- @imba-tjd made their first contribution in #499
- @LDLINGLINGLING made their first contribution in #504
- @baoyf4244 made their first contribution in #519
- @devin-ai-integration made their first contribution in #542
- @WanBenLe made their first contribution in #471
Full Changelog: v0.2.5...v0.2.6
v0.2.5
What's Changed
- Fix fused models for tf >= 4.39 by @TechxGenus in #418
- FIX: Add safe guards for static cache + llama on transformers latest by @younesbelkada in #401
- Pin: lm_eval==0.4.1 by @casper-hansen in #426
- Implement
apply_clip
argument toquantize()
by @casper-hansen in #427 - Workaround: illegal memory access by @casper-hansen in #421
- Add download_kwargs for load model (#302) by @Roshiago in #399
- add starcoder2 support by @shaonianyr in #406
- Add StableLM support by @Isotr0py in #410
- Fix starcoder2 fused norm by @TechxGenus in #442
- Update generate example to llama 3 by @casper-hansen in #448
- [BUG] Fix github action documentation build by @suparious in #449
- Fix path by @casper-hansen in #451
- FIX: 'awq_ext' is not defined error by @younesbelkada in #465
- FIX: Fix multiple generations for new HF cache format by @younesbelkada in #444
- support max_memory to specify mem usage for each GPU by @laoda513 in #460
- Bump to 0.2.5 by @casper-hansen in #468
New Contributors
- @Roshiago made their first contribution in #399
- @shaonianyr made their first contribution in #406
- @Isotr0py made their first contribution in #410
- @suparious made their first contribution in #449
- @laoda513 made their first contribution in #460
Full Changelog: v0.2.4...v0.2.5
v0.2.4
What's Changed
- Add Gemma Support by @TechxGenus in #393
- Pin transformers>=4.35.0,<=4.38.2 by @casper-hansen in #408
- Bump to v0.2.4 by @casper-hansen in #409
New Contributors
- @TechxGenus made their first contribution in #393
Full Changelog: v0.2.3...v0.2.4
v0.2.3
What's Changed
- New optimized kernels by @casper-hansen in #365
- Fix double bias by @casper-hansen in #383
- x_max -> x_mean and w_max -> w_mean name changes and some comments by @OscarSavolainenDR in #378
New Contributors
- @OscarSavolainenDR made their first contribution in #378
Full Changelog: v0.2.2...v0.2.3
v0.2.2
What's Changed
- Support Fused Mixtral on multi-GPU by @casper-hansen in #352
- Add multi-GPU benchmark of Mixtral by @casper-hansen in #353
- Remove MoE Triton kernels by @casper-hansen in #355
- Bump to 0.2.2 by @casper-hansen in #356
Full Changelog: v0.2.1...v0.2.2
v0.2.1
What's Changed
- Avoid downloading ROCm by @casper-hansen in #347
- ENH / FIX: Few enhancements and fix for mixed-precision training by @younesbelkada in #348
- Fix triton dependency by @casper-hansen in #350
- Bump to 0.2.1 by @casper-hansen in #351
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
- AWQ: Separate the AWQ kernels to separate repository by @casper-hansen in #279
- Add CPU-loaded multi-GPU quantization by @xNul in #289
- GGUF compatible quantization (2, 3, 4 bit / any bit) by @casper-hansen in #285
- Exllama kernels support by @IlyasMoutawwakil in #313
- Cleanup requirements by @casper-hansen in #295
- Torch only inference + any-device quantization by @casper-hansen in #319
- Up to 60% faster context processing by @casper-hansen in #316
- Evaluation: Add more evals by @casper-hansen in #283
- Fixes a breaking change in autoawq by @younesbelkada in #325
- AMD ROCM Support by @IlyasMoutawwakil in #315
- Marlin symmetric quantization and inference by @IlyasMoutawwakil in #320
- Add qwen2 by @JustinLin610 in #321
- Fix n_samples by @casper-hansen in #326
- PEFT compatible GEMM by @casper-hansen in #324
- [
PEFT
] Fix PEFT batch size > 1 by @younesbelkada in #338 - v0.2.0 by @casper-hansen in #330
- Fix ROCm build by @casper-hansen in #342
- Fix dependency by @casper-hansen in #343
- Fix importlib by @casper-hansen in #344
- Fix workflow by @casper-hansen in #345
- Fix typo in setup.py by @casper-hansen in #346
New Contributors
- @xNul made their first contribution in #289
- @IlyasMoutawwakil made their first contribution in #313
- @JustinLin610 made their first contribution in #321
Full Changelog: v0.1.8...v0.2.0
v0.1.8
What's Changed
- Fix MPT by @casper-hansen in #206
- Add config to Base model by @casper-hansen in #207
- Add Qwen model by @Sanster in #182
- Robust quantization for Catcher by @casper-hansen in #209
- New scaling to improve perplexity by @casper-hansen in #216
- Benchmark hf generate by @casper-hansen in #237
- Fix position ids by @casper-hansen in #215
- Pass
model_init_kwargs
tocheck_and_get_model_type
function by @rycont in #232 - Fixed an issue where the Qwen model had too much error after quantization by @jundolc in #243
- Load on CPU to avoid OOM by @casper-hansen in #236
- Update README.md by @casper-hansen in #245
- [
core
] Make AutoAWQ fused modules compatible with HF transformers by @younesbelkada in #244 - [
core
] Fix quantization issues with transformers==4.36.0 by @younesbelkada in #249 - FEAT: Add possibility of skipping modules when quantizing by @younesbelkada in #248
- Fix quantization issue with transformers >= 4.36.0 by @younesbelkada in #264
- Mixtral: Mixture of Experts quantization by @casper-hansen in #251
- Fused rope theta by @casper-hansen in #270
- FEAT: add llava to autoawq by @younesbelkada in #250
- Add Baichuan2 Support by @AoyuQC in #247
- Set default rope_theta on LlamaLikeBlock by @casper-hansen in #271
- Update news and models supported by @casper-hansen in #272
- Add vLLM async example by @casper-hansen in #273
- Bump to v0.1.8 by @casper-hansen in #274
New Contributors
- @Sanster made their first contribution in #182
- @rycont made their first contribution in #232
- @jundolc made their first contribution in #243
- @AoyuQC made their first contribution in #247
Full Changelog: v0.1.7...v0.1.8
v0.1.7
What's Changed
- Build older cuda wheels by @casper-hansen in #158
- Exclude download of CUDA wheels by @casper-hansen in #159
- New benchmarks in README by @casper-hansen in #160
- Fix typo in benchmark command by @casper-hansen in #161
- Yi support by @casper-hansen in #167
- Make sure to delete dummy model by @casper-hansen in #180
- Fix CUDA error: invalid argument by @casper-hansen in #179
- New logic for passing past_key_value by @younesbelkada in #177
- Reset cache on new generation by @casper-hansen in #178
- Adaptive batch sizing by @casper-hansen in #181
- Pass arguments to AutoConfig by @s4rduk4r in #97
- Fix cache util logic by @casper-hansen in #186
- Fix multi-GPU loading and inference by @casper-hansen in #190
- [
core
] ReplaceQuantLlamaMLP
withQuantFusedMLP
by @younesbelkada in #188 - [
core
] Addis_hf_transformers
flag by @younesbelkada in #195 - Fixed multi-GPU quantization by @casper-hansen in #196
Full Changelog: v0.1.6...v0.1.7
v0.1.6
What's Changed
- Pseudo dequantize function by @casper-hansen in #127
- CUDA 11.8.0 and 12.1.1 build by @casper-hansen in #128
- AwqConfig class by @casper-hansen in #132
- Fix init quant by @casper-hansen in #136
- Update readme by @casper-hansen in #137
- Benchmark info by @casper-hansen in #138
- Bump to v0.1.6 by @casper-hansen in #139
- CUDA 12 release by @casper-hansen in #140
- Revert to previous version by @casper-hansen in #141
- Fix performance regression by @casper-hansen in #148
- [
core
/attention
] Fix fused attention generation with newest transformers version by @younesbelkada in #146 - Fix condition when rolling cache by @casper-hansen in #150
- Default to safetensors for quantized models by @casper-hansen in #151
- Create fused LlamaLikeModel by @casper-hansen in #152
Full Changelog: v0.1.5...v0.1.6