MODEL REQUESTS #69

robertgshaw2-neuralmagic · 2024-08-08T13:35:27Z

Please comment here any model requests for:

Models to be added to https://huggingface.co/neuralmagic
Example scripts within llm-compressor

The text was updated successfully, but these errors were encountered:

BlackSamorez · 2024-08-08T14:10:50Z

A gemma-2-27b-it in 8 bits for both a100 and h100 would be nice.
I tried to produce them myself but the resulting checkpoints return NaNs when loaded into vLLM.

robertgshaw2-neuralmagic · 2024-08-08T14:12:23Z

A gemma-2-27b-it in 8 bits for both a100 and h100 would be nice. I tried to produce them myself but the resulting checkpoints return NaNs when loaded into vLLM.

Thanks - looking for fp8 for H100 and int8 for A100?

BlackSamorez · 2024-08-08T14:18:53Z

A gemma-2-27b-it in 8 bits for both a100 and h100 would be nice. I tried to produce them myself but the resulting checkpoints return NaNs when loaded into vLLM.

Thanks - looking for fp8 for H100 and int8 for A100?

Exactly!

robertgshaw2-neuralmagic · 2024-08-08T14:20:32Z

A gemma-2-27b-it in 8 bits for both a100 and h100 would be nice. I tried to produce them myself but the resulting checkpoints return NaNs when loaded into vLLM.

Thanks - looking for fp8 for H100 and int8 for A100?

Exactly!

Can you share more about the issue you were seeing?

BlackSamorez · 2024-08-08T15:02:15Z

I'm getting empty generations and unserializeable logits, indicating NaNs in model outputs.
I used practically the same recipe as in the Llama-3.1-70b-Instruct-FP8 quant

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
"""

robertgshaw2-neuralmagic · 2024-08-08T15:17:22Z

I'm getting empty generations and unserializeable logits, indicating NaNs in model outputs. I used practically the same recipe as in the Llama-3.1-70b-Instruct-FP8 quant

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
"""

Could be a FlashInfer issue. Ill work on an example for you

edgan8 · 2024-08-08T21:55:05Z

Hi @robertgshaw2-neuralmagic , could we get an update to https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8 ? The main model https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 had its tokenizer updated recently and it would be great to incorporate these into the quantized model.

Syst3m1cAn0maly · 2024-08-10T07:33:00Z

Hi !
A phi-3-vision would be very nice in FP8 (ideally with k/v scales)
Thanks in advance !

robertgshaw2-neuralmagic · 2024-08-11T22:41:50Z

Hi @robertgshaw2-neuralmagic , could we get an update to https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8 ? The main model https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 had its tokenizer updated recently and it would be great to incorporate these into the quantized model.

Absolutely @Lin-K76 - could you update this when you have a chance this week

robertgshaw2-neuralmagic · 2024-08-11T22:43:37Z

Hi ! A phi-3-vision would be very nice in FP8 (ideally with k/v scales) Thanks in advance !

We can take a look at this, adding support for Vision models is on our roadmap but we need to try it out a bit more.

robertgshaw2-neuralmagic · 2024-08-12T00:40:07Z

I'm getting empty generations and unserializeable logits, indicating NaNs in model outputs. I used practically the same recipe as in the Llama-3.1-70b-Instruct-FP8 quant

recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
"""

@BlackSamorez - I made a couple examples with gemma2 for you (#78)

Note: gemma2 has been a bit unstable in vllm due to the soft capping on the logits. We are stabilizing this as part of the current release process.

Here's install instructions on the vllm side:

export VLLM_VERSION=0.5.4
pip install [https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl](https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-$%7BVLLM_VERSION%7D-cp38-abi3-manylinux1_x86_64.whl)
pip install lm_eval==0.4.3
pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.2/flashinfer-0.1.2+cu121torch2.4-cp310-cp310-linux_x86_64.whl

Eval fp16:

MODEL=google/gemma-2-27b-it
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=$MODEL,add_bos_token=true --tasks gsm8k --num_fewshot 5 --limit 250 --batch_size "auto"

vllm (pretrained=google/gemma-2-27b-it,add_bos_token=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.864|±  |0.0217|
|     |       |strict-match    |     5|exact_match|↑  |0.848|±  |0.0228|

Eval fp8 (made with the script):

MODEL=gemma-2-27b-it-FP8-Dynamic
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=$MODEL,add_bos_token=true --tasks gsm8k --num_fewshot 5 --limit 250 --batch_size "auto"

vllm (pretrained=gemma-2-27b-it-FP8-Dynamic,add_bos_token=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.856|±  |0.0222|
|     |       |strict-match    |     5|exact_match|↑  |0.852|±  |0.0225|

The strict-match scores (the one that matters) is not impacted. This shows the fp8 quantization is working.

We will push a model up to the hub later this week once we have a chance to QA it.

Lin-K76 · 2024-08-12T04:04:30Z

Hi @robertgshaw2-neuralmagic , could we get an update to https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8 ? The main model https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 had its tokenizer updated recently and it would be great to incorporate these into the quantized model.

Hi, the new model is now live at https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8.

robertgshaw2-neuralmagic · 2024-08-12T04:06:28Z

Thanks @Lin-K76 !

yzlnew · 2024-08-12T11:45:42Z

Qwen2 series in marlin24 format. I'm having trouble generating model (0.5B and 72B) with proper output, getting NaN logits. Config in #54.

Oneshot with 2:4 sparse or GPTQ alone is fine, but not both. Do I need to change my calibration dataset or GPTQ config?

robertgshaw2-neuralmagic · 2024-08-12T13:45:11Z

Qwen2 series in marlin24 format. I'm having trouble generating model (0.5B and 72B) with proper output, getting NaN logits. Config in #54.

Oneshot with 2:4 sparse or GPTQ alone is fine, but not both. Do I need to change my calibration dataset or GPTQ config?

Thanks @yzlnew, I will take a look.

My suggestion though would be to use the W8A8 (int8 on ampere / fp8 on hopper) for production use cases as this will give you the best recovery and performance right now.

We are still working on making sparsity better. I will work on a demo for you later this week though :)

supa-thibaud · 2024-08-20T08:09:57Z

the Hermes 3 70b in int4 could be very great!

halexan · 2024-08-22T03:10:54Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !

How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !

Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

robertgshaw2-neuralmagic · 2024-08-22T14:43:38Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !

How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !

Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.

We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

sigjhl · 2024-08-23T07:50:45Z

Hi, can I please ask for a gemma-2-27b-int8? It's a good fit for 48GB cards and I'd love to run it with vLLM. Many quantization methods seem broken for this model unfortunately... would really appreciate it!

fengyang95 · 2024-08-24T10:58:13Z

the Hermes 3 70b in int4 could be very great!

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.

We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

DeepSeek-Coder-V2-Instruct in W4A16 would be great! Looking forward to your model release.

fengyang95 · 2024-08-24T12:39:17Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.

We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

I tried to quantize deepseek-coder-v2 to w4a16, but the following error occurred.
ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM.
Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

robertgshaw2-neuralmagic · 2024-08-24T16:23:01Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

What is your transformers version?

Also - note that quantization support for MoEs is still under construction in vllm.

halexan · 2024-08-25T03:41:53Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.

We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

Do you mean this PR #7766 ? for W4A16 ? @robertgshaw2-neuralmagic

fengyang95 · 2024-08-25T15:31:28Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

What is your transformers version?

Also - note that quantization support for MoEs is still under construction in vllm.

I see, I forgot to set trust_remote_code=True.

robertgshaw2-neuralmagic · 2024-08-25T18:30:13Z

#7766

yes

halexan · 2024-08-26T02:52:04Z

#7766

yes

So, What's the difference between #7415 and #7766?

fengyang95 · 2024-08-26T05:06:14Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

What is your transformers version?

Also - note that quantization support for MoEs is still under construction in vllm.

I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".

robertgshaw2-neuralmagic · 2024-08-28T16:10:10Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

What is your transformers version?
Also - note that quantization support for MoEs is still under construction in vllm.

I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".

This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?

I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G.

When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?

Can you try this example here with sequential_update:

https://github.com/vllm-project/llm-compressor/blob/main/examples/big_models_with_accelerate/multi_gpu_int8_sequential_update.py

You'll need to install from source for this

fengyang95 · 2024-08-29T02:17:21Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

What is your transformers version?
Also - note that quantization support for MoEs is still under construction in vllm.

I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".

This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?

I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G.
When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?

Can you try this example here with sequential_update:

https://github.com/vllm-project/llm-compressor/blob/main/examples/big_models_with_accelerate/multi_gpu_int8_sequential_update.py

You'll need to install from source for this
Yes, I used sequential_update=True. Here is my code.
If this is not set, it will use more GPU memory and cause OOM.

from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import argparse
from typing import Dict, Union

import psutil
import torch
from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoModelForCausalLM
import flash_attn

print(flash_attn.__version__)


def custom_offload_device_map(
        model_stub: str,
        max_memory_per_gpu: Union[str, int],
        max_memory_gpu0: Union[str, int],
        num_gpus: int = 1,
        offload_buffers:bool=False,
        **model_kwargs,
) -> Dict[Union[int, str], Union[int, str]]:
    """
    Calculates the optimal gpu mappings for model_stub stored as torch_dtype, where
    each GPU is restricted to allocating a specific amount of memory.

    :param model_stub: local path or HF stub to calculate mapping for
    :param max_memory_per_gpu: Max memory to allocate on each GPU, as either a string
        such as "10GB" or an integer number of bytes
    :param num_gpus: number of gpus to utilize
    :param model_kwargs: keyword arguments to pass to model initializer
    :return: memory mapping for layers of model_stub to be passed to from_pretrained()
    """
    max_cpu_memory = psutil.virtual_memory().available
    memory_limits = {device: max_memory_per_gpu for device in range(1, num_gpus)}
    memory_limits[0] = max_memory_gpu0
    memory_limits["cpu"] = max_cpu_memory

    with init_empty_weights():
        dummy_model = AutoModelForCausalLM.from_pretrained(model_stub, **model_kwargs)
        device_map = infer_auto_device_map(
            dummy_model,
            max_memory=memory_limits,
            no_split_module_classes=dummy_model._no_split_modules,
            offload_buffers=offload_buffers
        )
        del dummy_model

    return device_map


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-id", type=str, default=None)
    parser.add_argument("--dataset-dir", type=str, default=None)
    parser.add_argument("--save-dir", type=str, default=None)
    # parser.add_argument("", type=str, default="auto")
    parser.add_argument("--max-memory-per-gpu", type=str, default="35GB")
    parser.add_argument("--max-memory-gpu0", type=str, default="35GB")
    parser.add_argument("--device-map", type=str, default='auto')
    parser.add_argument("--num-samples", type=int, default=512)
    parser.add_argument("--offload-buffers",type=bool,default=False)
    args = parser.parse_args()

    from datasets import load_dataset

    NUM_CALIBRATION_SAMPLES = args.num_samples
    MAX_SEQUENCE_LENGTH = 2048

    # Load dataset.
    ds = load_dataset(args.dataset_dir, split="train_sft")
    ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

    tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)


    # Preprocess the data into the format the model is trained with.
    def preprocess(example):
        return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False, )}


    ds = ds.map(preprocess)


    # Tokenize the data (be careful with bos tokens - we need add_special_tokens=False since the chat_template already added it).
    def tokenize(sample):
        return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True,
                         add_special_tokens=False)


    ds = ds.map(tokenize, remove_columns=ds.column_names)

    from llmcompressor.transformers import oneshot
    from llmcompressor.modifiers.quantization import GPTQModifier

    # Configure the quantization algorithm to run.
    recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"], sequential_update=True)
    num_gpus = 8

    if args.device_map == "cpu":
        device_map = "cpu"
    else:
        device_map = custom_offload_device_map(
            args.model_id, max_memory_per_gpu=args.max_memory_per_gpu, max_memory_gpu0=args.max_memory_gpu0,
            num_gpus=num_gpus, trust_remote_code=True, torch_dtype=torch.bfloat16,offload_buffers=args.offload_buffers

        )

    model = SparseAutoModelForCausalLM.from_pretrained(
        args.model_id, trust_remote_code=True, device_map=device_map, torch_dtype=torch.bfloat16,
    )
    # Apply quantization.
    oneshot(
        model=model, dataset=ds,
        recipe=recipe,
        max_seq_length=MAX_SEQUENCE_LENGTH,
        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    )

    # Save to disk compressed.
    model.save_pretrained(args.save_dir, save_compressed=True)
    tokenizer.save_pretrained(args.save_dir)

fengyang95 · 2024-08-30T02:26:24Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

What is your transformers version?
Also - note that quantization support for MoEs is still under construction in vllm.

I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".

This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?

I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G.
When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?

Can you try this example here with sequential_update:

https://github.com/vllm-project/llm-compressor/blob/main/examples/big_models_with_accelerate/multi_gpu_int8_sequential_update.py

You'll need to install from source for this

How should I load a w4a16 version of deepseek-v2 by vllm that was compressed using llm-compressor?
I used quantization=compressed-tensors, but it throws an error:

File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 192, in init
assert self.quant_method is not None

robertgshaw2-neuralmagic · 2024-08-30T14:02:54Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

What is your transformers version?
Also - note that quantization support for MoEs is still under construction in vllm.

I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".

This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?

I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G.
When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?

Can you try this example here with sequential_update:

https://github.com/vllm-project/llm-compressor/blob/main/examples/big_models_with_accelerate/multi_gpu_int8_sequential_update.py

You'll need to install from source for this

How should I load a w4a16 version of deepseek-v2 by vllm that was compressed using llm-compressor? I used quantization=compressed-tensors, but it throws an error:

File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 192, in init assert self.quant_method is not None

Release v0.5.6 will support it. Need this PR: vllm-project/vllm#7766

fengyang95 · 2024-08-30T15:41:58Z

neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great !
How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great !
Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ?

Hello! Currently in vllm, we only support FP8 inference for MoE models.
We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap.

I tried to quantize deepseek-coder-v2, but the following error occurred. ValueError: Unrecognized configuration class <class 'transformers_modules.deepseek_7b.configuration_deepseek.DeepseekV2Config'> for this kind of AutoModel: SparseAutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, Gemma2Config, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, Mamba2Config, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NemotronConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.

What is your transformers version?
Also - note that quantization support for MoEs is still under construction in vllm.

I tried to quantize deepseek-v2 to w4a16 (using A100 80G * 8, 1800G memory), but it suddenly gets killed when running to "INFO - Preparing model.layers.58 for compression".

This usually means you’re running out of CPU memory. This is a big model … how much CPU RAM and GPU RAM do you have?

I tried quantizing deepseek-coder-v2-instruct using 8 A100 80G GPUs. To avoid OOM, I set memory_limits to 35G.
When it reached the 32nd layer during quantization, the speed suddenly slowed down. I suspect that this portion of the parameters was loaded to the CPU, causing the slowdown. But why is it even slower than loading everything to the CPU?

Can you try this example here with sequential_update:

https://github.com/vllm-project/llm-compressor/blob/main/examples/big_models_with_accelerate/multi_gpu_int8_sequential_update.py

You'll need to install from source for this

How should I load a w4a16 version of deepseek-v2 by vllm that was compressed using llm-compressor? I used quantization=compressed-tensors, but it throws an error:
File "/usr/local/lib/python3.9/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 192, in init assert self.quant_method is not None

Release v0.5.6 will support it. Need this PR: vllm-project/vllm#7766

Is this PR still in progress? Do you have an estimated timeline?

fengyang95 · 2024-09-10T12:39:10Z

@robertgshaw2-neuralmagic I use this framework with 512 data points to calibrate the quantized deepseek-v2.5 model. The output result is "!!". Are there any tricks for quantizing this model?
Here is my script:

from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import argparse
from typing import Dict, Union
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
import psutil
import torch
from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoModelForCausalLM
import flash_attn
from datasets import load_dataset

print(flash_attn.__version__)


def custom_offload_device_map(
        model_stub: str,
        max_memory_per_gpu: Union[str, int],
        max_memory_gpu0: Union[str, int],
        num_gpus: int = 1,
        offload_buffers: bool = False,
        **model_kwargs,
) -> Dict[Union[int, str], Union[int, str]]:
    """
    Calculates the optimal gpu mappings for model_stub stored as torch_dtype, where
    each GPU is restricted to allocating a specific amount of memory.

    :param model_stub: local path or HF stub to calculate mapping for
    :param max_memory_per_gpu: Max memory to allocate on each GPU, as either a string
        such as "10GB" or an integer number of bytes
    :param num_gpus: number of gpus to utilize
    :param model_kwargs: keyword arguments to pass to model initializer
    :return: memory mapping for layers of model_stub to be passed to from_pretrained()
    """
    max_cpu_memory = psutil.virtual_memory().available
    memory_limits = {device: max_memory_per_gpu for device in range(1, num_gpus)}
    memory_limits[0] = max_memory_gpu0
    memory_limits["cpu"] = max_cpu_memory

    with init_empty_weights():
        dummy_model = AutoModelForCausalLM.from_pretrained(model_stub, **model_kwargs)
        device_map = infer_auto_device_map(
            dummy_model,
            max_memory=memory_limits,
            no_split_module_classes=dummy_model._no_split_modules,
            offload_buffers=offload_buffers
        )
        del dummy_model

    return device_map


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-id", type=str, default="/opt/tiger/deepseek_http/models--deepseek-ai--DeepSeek-V2.5")
    parser.add_argument("--dataset-dir", type=str,
                        default="/opt/tiger/deepseek_http/datasets--HuggingFaceH4--ultrachat_200k")
    parser.add_argument("--max-memory-per-gpu", type=str, default="52GB")
    parser.add_argument("--max-memory-gpu0", type=str, default="52GB")
    parser.add_argument("--device-map", type=str, default='auto')
    parser.add_argument("--num-samples", type=int, default=512)
    parser.add_argument("--offload-buffers",  action='store_true')
    parser.add_argument("--max-model-len", type=int, default=8192)
    parser.add_argument("--sequential-update", action='store_true')
    parser.add_argument("--dataset-split", type=str, default='train_sft')
    args = parser.parse_args()

    # Select calibration dataset.
    DATASET_ID = args.dataset_dir
    DATASET_SPLIT = args.dataset_split

    MAX_SEQUENCE_LENGTH = args.max_model_len
    NUM_CALIBRATION_SAMPLES = args.num_samples
    # Load dataset and preprocess.
    ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
    ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

    tokenizer = AutoTokenizer.from_pretrained(args.model_id)


    def preprocess(example):
        if 'messages' in example:
            messages = example['messages']
        elif 'input' in example and 'output' in example:
            messages = [
                {
                    "role": "user",
                    "content": example['input']
                },
                {
                    "role": "assistant",
                    "content": example['output']
                }
            ]
        else:
            raise ValueError("in valid example")
        return {
            "text": tokenizer.apply_chat_template(
                messages,
                tokenize=False,
            )
        }


    ds = ds.map(preprocess)


    # Tokenize inputs.
    def tokenize(sample):
        return tokenizer(
            sample["text"],
            padding=False,
            max_length=MAX_SEQUENCE_LENGTH,
            truncation=True,
            add_special_tokens=False,
        )


    ds = ds.map(tokenize, remove_columns=ds.column_names)

    # define a llmcompressor recipe for W8A8 quantization
    recipe = GPTQModifier(
        targets="Linear", scheme="W4A16", ignore=["lm_head"], sequential_update=args.sequential_update
    )

    if args.device_map == "cpu":
        model = SparseAutoModelForCausalLM.from_pretrained(
            args.model_id, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True
        )
    else:
        device_map = custom_offload_device_map(
            model_stub=args.model_id,
            max_memory_per_gpu=args.max_memory_per_gpu,
            max_memory_gpu0=args.max_memory_gpu0,
            num_gpus=8,
            offload_buffers=args.offload_buffers,
            trust_remote_code=True
        )
        model = SparseAutoModelForCausalLM.from_pretrained(
            args.model_id, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True
        )

    SAVE_DIR = args.model_id + '-W4A16'

    oneshot(
        model=model, dataset=ds,
        recipe=recipe,
        max_seq_length=MAX_SEQUENCE_LENGTH,
        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    )

    # Save to disk compressed.
    model.save_pretrained(SAVE_DIR, save_compressed=True,
                          skip_compression_stats=True)
    tokenizer.save_pretrained(SAVE_DIR)

robertgshaw2-neuralmagic · 2024-09-10T15:59:15Z

@robertgshaw2-neuralmagic I use this framework with 512 data points to calibrate the quantized deepseek-v2.5 model. The output result is "!!". Are there any tricks for quantizing this model? Here is my script:

from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import argparse
from typing import Dict, Union
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
import psutil
import torch
from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoModelForCausalLM
import flash_attn
from datasets import load_dataset

print(flash_attn.__version__)


def custom_offload_device_map(
        model_stub: str,
        max_memory_per_gpu: Union[str, int],
        max_memory_gpu0: Union[str, int],
        num_gpus: int = 1,
        offload_buffers: bool = False,
        **model_kwargs,
) -> Dict[Union[int, str], Union[int, str]]:
    """
    Calculates the optimal gpu mappings for model_stub stored as torch_dtype, where
    each GPU is restricted to allocating a specific amount of memory.

    :param model_stub: local path or HF stub to calculate mapping for
    :param max_memory_per_gpu: Max memory to allocate on each GPU, as either a string
        such as "10GB" or an integer number of bytes
    :param num_gpus: number of gpus to utilize
    :param model_kwargs: keyword arguments to pass to model initializer
    :return: memory mapping for layers of model_stub to be passed to from_pretrained()
    """
    max_cpu_memory = psutil.virtual_memory().available
    memory_limits = {device: max_memory_per_gpu for device in range(1, num_gpus)}
    memory_limits[0] = max_memory_gpu0
    memory_limits["cpu"] = max_cpu_memory

    with init_empty_weights():
        dummy_model = AutoModelForCausalLM.from_pretrained(model_stub, **model_kwargs)
        device_map = infer_auto_device_map(
            dummy_model,
            max_memory=memory_limits,
            no_split_module_classes=dummy_model._no_split_modules,
            offload_buffers=offload_buffers
        )
        del dummy_model

    return device_map


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-id", type=str, default="/opt/tiger/deepseek_http/models--deepseek-ai--DeepSeek-V2.5")
    parser.add_argument("--dataset-dir", type=str,
                        default="/opt/tiger/deepseek_http/datasets--HuggingFaceH4--ultrachat_200k")
    parser.add_argument("--max-memory-per-gpu", type=str, default="52GB")
    parser.add_argument("--max-memory-gpu0", type=str, default="52GB")
    parser.add_argument("--device-map", type=str, default='auto')
    parser.add_argument("--num-samples", type=int, default=512)
    parser.add_argument("--offload-buffers",  action='store_true')
    parser.add_argument("--max-model-len", type=int, default=8192)
    parser.add_argument("--sequential-update", action='store_true')
    parser.add_argument("--dataset-split", type=str, default='train_sft')
    args = parser.parse_args()

    # Select calibration dataset.
    DATASET_ID = args.dataset_dir
    DATASET_SPLIT = args.dataset_split

    MAX_SEQUENCE_LENGTH = args.max_model_len
    NUM_CALIBRATION_SAMPLES = args.num_samples
    # Load dataset and preprocess.
    ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
    ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

    tokenizer = AutoTokenizer.from_pretrained(args.model_id)


    def preprocess(example):
        if 'messages' in example:
            messages = example['messages']
        elif 'input' in example and 'output' in example:
            messages = [
                {
                    "role": "user",
                    "content": example['input']
                },
                {
                    "role": "assistant",
                    "content": example['output']
                }
            ]
        else:
            raise ValueError("in valid example")
        return {
            "text": tokenizer.apply_chat_template(
                messages,
                tokenize=False,
            )
        }


    ds = ds.map(preprocess)


    # Tokenize inputs.
    def tokenize(sample):
        return tokenizer(
            sample["text"],
            padding=False,
            max_length=MAX_SEQUENCE_LENGTH,
            truncation=True,
            add_special_tokens=False,
        )


    ds = ds.map(tokenize, remove_columns=ds.column_names)

    # define a llmcompressor recipe for W8A8 quantization
    recipe = GPTQModifier(
        targets="Linear", scheme="W4A16", ignore=["lm_head"], sequential_update=args.sequential_update
    )

    if args.device_map == "cpu":
        model = SparseAutoModelForCausalLM.from_pretrained(
            args.model_id, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True
        )
    else:
        device_map = custom_offload_device_map(
            model_stub=args.model_id,
            max_memory_per_gpu=args.max_memory_per_gpu,
            max_memory_gpu0=args.max_memory_gpu0,
            num_gpus=8,
            offload_buffers=args.offload_buffers,
            trust_remote_code=True
        )
        model = SparseAutoModelForCausalLM.from_pretrained(
            args.model_id, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True
        )

    SAVE_DIR = args.model_id + '-W4A16'

    oneshot(
        model=model, dataset=ds,
        recipe=recipe,
        max_seq_length=MAX_SEQUENCE_LENGTH,
        num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    )

    # Save to disk compressed.
    model.save_pretrained(SAVE_DIR, save_compressed=True,
                          skip_compression_stats=True)
    tokenizer.save_pretrained(SAVE_DIR)

Thanks @fengyang95 - @dsikka is looking into this

dsikka · 2024-09-10T16:47:09Z

Hey @fengyang95 - investigating this issue. Will update once fixed.
Thanks!

dsikka · 2024-09-12T02:30:30Z

Hi @fengyang95 - can you share the code you're using which generates !!!?

We have also added this example which you can follow:
https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w8a8.py
You can swap the model to the lager model and the scheme to W4A16.

You'll need to use the latest main to pull in a fix that was needed for deepseek_v2

fengyang95 · 2024-09-12T03:49:48Z

pull in a fix that was needed for deepseek_v

python3 -m vllm.entrypoints.openai.api_server --model DeepSeek-V2.5-W4A16 ---served-model-name dsv2 --trust-remote-code --tensor-parallel-size 8 --max-model-len 16384 --port $PORT0 --gpu-memory-utilization 0.9 --quantization compressed-tensors --force-eager

fengyang95 · 2024-09-12T03:50:38Z

python3 -m vllm.entrypoints.openai.api_server --model DeepSeek-V2.5-W4A16 ---served-model-name dsv2 --trust-remote-code --tensor-parallel-size 8 --max-model-len 16384 --port $PORT0 --gpu-memory-utilization 0.9 --quantization compressed-tensors --force-eager

Hi @fengyang95 - can you share the code you're using which generates !!!?

We have also added this example which you can follow: https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w8a8.py You can swap the model to the lager model and the scheme to W4A16.

You'll need to use the latest main to pull in a fix that was needed for deepseek_v2

Thank you, I'll try it right away.

fengyang95 · 2024-09-13T05:55:27Z

Hi @fengyang95 - can you share the code you're using which generates !!!?

We have also added this example which you can follow: https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w8a8.py You can swap the model to the lager model and the scheme to W4A16.

You'll need to use the latest main to pull in a fix that was needed for deepseek_v2

Hi @dsikka , I followed your suggestion to ignore the gate parameter and updated the code. However, the quantized model still outputs "!!!". Have you tested this on DeepSeek-v2.5?

dsikka · 2024-09-13T15:32:57Z

Hi @fengyang95 there was a bug in vLLM which has now been fixed on main. Do you mind trying it again?
We have also added a W4A16 end-to-end example: https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w4a16.py

fengyang95 · 2024-09-13T15:52:17Z

Hi @fengyang95 there was a bug in vLLM which has now been fixed on main. Do you mind trying it again? We have also added a W4A16 end-to-end example: https://github.com/vllm-project/llm-compressor/blob/main/examples/quantizing_moe/deepseek_moe_w4a16.py

I'll try it asap

TheAhmadOsman · 2024-09-17T08:16:50Z

@dsikka

I am getting the following error while trying to run https://huggingface.co/nm-testing/DeepSeek-V2.5-W4A16

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/vllm/vllm/worker/model_runner_base.py", line 112, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/worker/model_runner.py", line 1547, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 504, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 461, in forward
    hidden_states, residual = layer(positions, hidden_states,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 401, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 148, in forward
    final_hidden_states = self.experts(
                          ^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 469, in forward
    final_hidden_states = self.quant_method.apply(
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 285, in apply
    return fused_marlin_moe(
           ^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 150, in fused_marlin_moe
    assert hidden_states.dtype == torch.float16
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 242, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/engine/async_llm_engine.py", line 576, in from_engine_args
    engine = cls(
             ^^^^
  File "/vllm/vllm/engine/async_llm_engine.py", line 471, in __init__
    self.engine = self._engine_class(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/engine/async_llm_engine.py", line 260, in __init__
    super().__init__(*args, **kwargs)
  File "/vllm/vllm/engine/llm_engine.py", line 331, in __init__
    self._initialize_kv_caches()
  File "/vllm/vllm/engine/llm_engine.py", line 465, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
    num_blocks = self._run_workers("determine_num_available_blocks", )
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/worker/model_runner.py", line 1219, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/worker/model_runner_base.py", line 126, in _wrapper
    raise type(err)(
AssertionError: Error in model execution (input dumped to /tmp/err_execute_model_input_20240917-022954.pkl): 
ERROR 09-17 02:30:01 api_server.py:203] RPCServer process died before responding to readiness probe
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown                                                                                        
  warnings.warn('resource_tracker: There appear to be %d '

The command I ran:
vllm serve nm-testing/DeepSeek-V2.5-W4A16 --tensor-parallel-size 8 --gpu-memory-utilization 0.96 --max-model-len 131072 --dtype auto --quantization compressed-tensors --trust-remote-code

dsikka · 2024-09-17T13:30:14Z

Hi @TheAhmadOsman - the current kernel supports float16. Could you pass that in for dtype?

TheAhmadOsman · 2024-09-18T04:03:51Z

@dsikka running vllm serve nm-testing/DeepSeek-V2.5-W4A16 --tensor-parallel-size 8 --gpu-memory-utilization 0.96 --max-model-len 131072 --dtype float16 --quantization compressed-tensors --trust-remote-code I get the following error:

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/worker/model_runner.py", line 1590, in execute_model
    hidden_or_intermediate_states = model_executable(
                                    ^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 504, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 461, in forward
    hidden_states, residual = layer(positions, hidden_states,
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 401, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 148, in forward
    final_hidden_states = self.experts(
                          ^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 469, in forward
    final_hidden_states = self.quant_method.apply(
                          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 285, in apply
    return fused_marlin_moe(
           ^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 171, in fused_marlin_moe
    sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
    ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
  File "/vllm/vllm/_custom_ops.py", line 32, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/_custom_ops.py", line 800, in moe_align_block_size
    torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
  File "/vllm/venv/lib/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
    return self_._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 242, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/engine/async_llm_engine.py", line 576, in from_engine_args
    engine = cls(
             ^^^^
  File "/vllm/vllm/engine/async_llm_engine.py", line 471, in __init__
    self.engine = self._engine_class(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/engine/async_llm_engine.py", line 260, in __init__
    super().__init__(*args, **kwargs)
  File "/vllm/vllm/engine/llm_engine.py", line 331, in __init__
    self._initialize_kv_caches()
  File "/vllm/vllm/engine/llm_engine.py", line 465, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
    num_blocks = self._run_workers("determine_num_available_blocks", )
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/worker/worker.py", line 223, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/worker/model_runner.py", line 1236, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/vllm/vllm/worker/model_runner_base.py", line 144, in _wrapper
    raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20240917-230234.pkl): CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.```

TheAhmadOsman · 2024-09-18T04:14:30Z

@dsikka I just noticed that config.json file on the model's page has "torch_dtype": "bfloat16", unlike other models you uploaded which have it as float16. Also, I believe the end-to-end example you posted earlier also has dtype assigned as bfloat16.

TheAhmadOsman · 2024-09-24T06:55:19Z

@dsikka any thoughts? I'd appreciate any pointers

dsikka · 2024-09-27T15:02:11Z

Hi @TheAhmadOsman - the following command worked for me. Do you mind trying it?

vllm serve nm-testing/DeepSeek-V2.5-W4A16 --tensor-parallel-size 4 --max-model-len 1024 --dtype float16  --trust-remote-code

There was a bug introduced in vllm recently so you'll have to wait until the following bug fix lands:
vllm-project/vllm#8906

mphilippnv · 2024-10-08T23:56:39Z

I see a lot of Deepseek 2.5 discussion here. I'm very interested in an FP8 version so we can deploy on some H100's optimally. Appreciate all the work this team has done!

Syst3m1cAn0maly · 2024-10-09T06:21:37Z

Hi !
I see that some vision models are now supported for FP8 quantization, is Phi 3.5 Vision supported yet ?
And if so, what would be the recipe to apply ?

Thanks in advance

HelloCard · 2024-10-13T11:49:43Z

for byroneverson/internlm2_5-20b-chat-abliterated, can you quant it to w8a8?
git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
python setup.py install
and edit transformers's file to add internlm2's files, I have done this way to quant solar-pro-preview-instruct to w8a8 successful.
and edit /root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/text_generation.py to make any trust_remote_code is True.

from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import torch
MODEL_ID = "/root/autodl-tmp/internlm2_5-20b-chat-abliterated"
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)





from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 2048
MAX_SEQUENCE_LENGTH = 2048

# Load and preprocess the dataset
ds = load_dataset("/root/autodl-tmp/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)






from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

# Configure the quantization algorithms
recipe = [
    SmoothQuantModifier(smoothing_strength=0.88),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"], sequential_update=True),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save the compressed model
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

root@autodl-container-542540859a-97df04fc:~/autodl-tmp# python3 quant.py 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:11<00:00,  1.75it/s]
2024-10-13T19:41:24.999390+0800 | main | WARNING - Process rank: 0, device: cuda:0, n_gpu: 4, distributed training: True, 16-bits training: False
2024-10-13T19:41:24.999932+0800 | main | INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=4,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
clear_sparse_session=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_oneshot=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./output/runs/Oct13_19-41-24_autodl-container-542540859a-97df04fc,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
oneshot_device=cuda:0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=./output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
recipe=[SmoothQuantModifier(index=None, group=None, start=None, end=None, update=None, initialized_structure_=False, initialized_=False, finalized_=False, started_=False, ended_=False, smoothing_strength=0.88, mappings=None, ignore=None, num_calibration_steps=None, calibration_function=None, hooks_=None, resolved_mappings_=None, scales_=None), GPTQModifier(index=None, group=None, start=None, end=None, update=None, initialized_structure_=False, initialized_=False, finalized_=False, started_=False, ended_=False, sequential_update=True, targets='Linear', sequential_targets=None, block_size=128, quantize=True, dampening_frac=0.01, config_groups=None, ignore=['lm_head'], disable_quantization_observer_epoch=None, num_calibration_steps=None, scheme='W8A8', model=None, layer_compressors_=None, compressible_layers_=None, quantization_modifier_=None)],
recipe_args=None,
remove_unused_columns=True,
report_to=['tensorboard'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=./output,
run_stages=False,
save_compressed=True,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
2024-10-13T19:41:25.908747+0800 | _check_create_state | INFO - State created for compression lifecycle
2024-10-13T19:41:25.910896+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-10-13T19:41:25.912380+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-10-13T19:41:25.940224+0800 | one_shot | INFO - *** One Shot ***
2024-10-13T19:41:25.945235+0800 | from_modifiers | INFO - Creating recipe from modifiers
2024-10-13T19:41:25.980732+0800 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
2024-10-13T19:41:25.980834+0800 | _infer_mappings_from_model | INFO - No SmoothQuantModifier.mappings provided, inferring from model...
2024-10-13T19:41:25.980886+0800 | get_layer_mappings_from_architecture | INFO - Architecture InternLM2ForCausalLM not found in mappings. Using default mappings: [LayerMap(balance_layers=['re:.*q_proj', 're:.*k_proj', 're:.*v_proj'], smooth_layers='re:.*input_layernorm'), LayerMap(balance_layers=['re:.*gate_proj', 're:.*up_proj'], smooth_layers='re:.*post_attention_layernorm')]
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/utils.py", line 72, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/base.py", line 178, in _resolve_mappings
    to_smooth_layers = get_layers(to_smooth, model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/utils/pytorch/module.py", line 166, in get_layers
    return match_layers_params(targets, module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/utils/pytorch/module.py", line 160, in match_layers_params
    raise ValueError(f"Could not find targets {missed} in module {module}")
ValueError: Could not find targets ['re:.*input_layernorm'] in module InternLM2ForCausalLM(
  (model): InternLM2Model(
    (tok_embeddings): Embedding(92544, 6144, padding_idx=2)
    (layers): ModuleList(
      (0-47): 48 x InternLM2DecoderLayer(
        (attention): InternLM2Attention(
          (wqkv): Linear(in_features=6144, out_features=8192, bias=False)
          (wo): Linear(in_features=6144, out_features=6144, bias=False)
          (rotary_emb): InternLM2DynamicNTKScalingRotaryEmbedding()
        )
        (feed_forward): InternLM2MLP(
          (w1): Linear(in_features=6144, out_features=16384, bias=False)
          (w3): Linear(in_features=6144, out_features=16384, bias=False)
          (w2): Linear(in_features=16384, out_features=6144, bias=False)
          (act_fn): SiLU()
        )
        (attention_norm): InternLM2RMSNorm()
        (ffn_norm): InternLM2RMSNorm()
      )
    )
    (norm): InternLM2RMSNorm()
  )
  (output): Linear(in_features=6144, out_features=92544, bias=False)
)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/autodl-tmp/quant.py", line 49, in <module>
    oneshot(
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/text_generation.py", line 76, in oneshot
    main(model_args, data_args, training_args)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/text_generation.py", line 364, in main
    stage_runner.one_shot()
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/runner.py", line 171, in one_shot
    self.trainer.one_shot(calibration_data=calib_data, stage=stage)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/session_mixin.py", line 416, in one_shot
    apply(
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/session_functions.py", line 184, in apply
    return active_session().apply(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/session.py", line 210, in apply
    self.initialize(**kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/session.py", line 156, in initialize
    mod_data = self._lifecycle.initialize(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/lifecycle.py", line 126, in initialize
    data = mod.initialize(state=self.state, **extras)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/stage.py", line 124, in initialize
    modifier.initialize(state, **kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/modifier.py", line 119, in initialize
    initialized = self.on_initialize(state=state, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/base.py", line 126, in on_initialize
    self.resolved_mappings_ = self._resolve_mappings(state.model)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/utils.py", line 75, in wrapper
    raise RuntimeError(
RuntimeError: Error resolving mappings for given architecture.Please refer to the README at /root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/README.md for more information.

still failed, sad.

samos123 · 2024-10-13T20:16:44Z

Would the fp8 models published by Neural Magic work with tpu_int8 quantization in vLLM?

This is the error I get:

ValueError: Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the `quantization` argument (tpu_int8).

Should I try to publish int8 models and would those possibly work with compressend-tensors?

error I get when letting it choose quantizaiton method:

ValueError: compressed-tensors quantization is currently not supported in TPU Backend.

mgoin · 2024-10-14T15:24:35Z

@Syst3m1cAn0maly we have a Phi 3.5 vision models that was quantized to FP8, you can try it here https://huggingface.co/nm-testing/Phi-3.5-vision-instruct-FP8-dynamic

@samos123 for INT8 backends you must use INT8 models, maybe you can try some of the w8a8 models here https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415

samos123 · 2024-10-14T16:08:41Z

@mgoin I tried the w8a8 model as well but there is no TPU support for compressed tensors just yet, so it didn't work. I got an error saying exactly that.

The PR by @robertgshaw2-neuralmagic should fix this though: vllm-project/vllm#9301

robertgshaw2-neuralmagic · 2024-10-15T22:06:20Z

error I get when letting it choose quantizaiton method:

TPUs do not support fp8 quantization for acceleration. So we are focusing on:

w4a16
w8a16
w8a8 int8

Ideally will have something ready for next release. TBD on perf

robertgshaw2-neuralmagic · 2024-10-15T22:07:07Z

for byroneverson/internlm2_5-20b-chat-abliterated, can you quant it to w8a8? git clone https://github.com/vllm-project/llm-compressor.git cd llm-compressor python setup.py install and edit transformers's file to add internlm2's files, I have done this way to quant solar-pro-preview-instruct to w8a8 successful. and edit /root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/text_generation.py to make any trust_remote_code is True.

from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import torch
MODEL_ID = "/root/autodl-tmp/internlm2_5-20b-chat-abliterated"
model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)





from datasets import load_dataset

NUM_CALIBRATION_SAMPLES = 2048
MAX_SEQUENCE_LENGTH = 2048

# Load and preprocess the dataset
ds = load_dataset("/root/autodl-tmp/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)






from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

# Configure the quantization algorithms
recipe = [
    SmoothQuantModifier(smoothing_strength=0.88),
    GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"], sequential_update=True),
]

# Apply quantization
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save the compressed model
SAVE_DIR = MODEL_ID.split("/")[1] + "-W8A8-Dynamic-Per-Token"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

root@autodl-container-542540859a-97df04fc:~/autodl-tmp# python3 quant.py 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:11<00:00,  1.75it/s]
2024-10-13T19:41:24.999390+0800 | main | WARNING - Process rank: 0, device: cuda:0, n_gpu: 4, distributed training: True, 16-bits training: False
2024-10-13T19:41:24.999932+0800 | main | INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=4,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
clear_sparse_session=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_oneshot=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./output/runs/Oct13_19-41-24_autodl-container-542540859a-97df04fc,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
oneshot_device=cuda:0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=./output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
recipe=[SmoothQuantModifier(index=None, group=None, start=None, end=None, update=None, initialized_structure_=False, initialized_=False, finalized_=False, started_=False, ended_=False, smoothing_strength=0.88, mappings=None, ignore=None, num_calibration_steps=None, calibration_function=None, hooks_=None, resolved_mappings_=None, scales_=None), GPTQModifier(index=None, group=None, start=None, end=None, update=None, initialized_structure_=False, initialized_=False, finalized_=False, started_=False, ended_=False, sequential_update=True, targets='Linear', sequential_targets=None, block_size=128, quantize=True, dampening_frac=0.01, config_groups=None, ignore=['lm_head'], disable_quantization_observer_epoch=None, num_calibration_steps=None, scheme='W8A8', model=None, layer_compressors_=None, compressible_layers_=None, quantization_modifier_=None)],
recipe_args=None,
remove_unused_columns=True,
report_to=['tensorboard'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=./output,
run_stages=False,
save_compressed=True,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
2024-10-13T19:41:25.908747+0800 | _check_create_state | INFO - State created for compression lifecycle
2024-10-13T19:41:25.910896+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-10-13T19:41:25.912380+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-10-13T19:41:25.940224+0800 | one_shot | INFO - *** One Shot ***
2024-10-13T19:41:25.945235+0800 | from_modifiers | INFO - Creating recipe from modifiers
2024-10-13T19:41:25.980732+0800 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
2024-10-13T19:41:25.980834+0800 | _infer_mappings_from_model | INFO - No SmoothQuantModifier.mappings provided, inferring from model...
2024-10-13T19:41:25.980886+0800 | get_layer_mappings_from_architecture | INFO - Architecture InternLM2ForCausalLM not found in mappings. Using default mappings: [LayerMap(balance_layers=['re:.*q_proj', 're:.*k_proj', 're:.*v_proj'], smooth_layers='re:.*input_layernorm'), LayerMap(balance_layers=['re:.*gate_proj', 're:.*up_proj'], smooth_layers='re:.*post_attention_layernorm')]
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/utils.py", line 72, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/base.py", line 178, in _resolve_mappings
    to_smooth_layers = get_layers(to_smooth, model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/utils/pytorch/module.py", line 166, in get_layers
    return match_layers_params(targets, module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/utils/pytorch/module.py", line 160, in match_layers_params
    raise ValueError(f"Could not find targets {missed} in module {module}")
ValueError: Could not find targets ['re:.*input_layernorm'] in module InternLM2ForCausalLM(
  (model): InternLM2Model(
    (tok_embeddings): Embedding(92544, 6144, padding_idx=2)
    (layers): ModuleList(
      (0-47): 48 x InternLM2DecoderLayer(
        (attention): InternLM2Attention(
          (wqkv): Linear(in_features=6144, out_features=8192, bias=False)
          (wo): Linear(in_features=6144, out_features=6144, bias=False)
          (rotary_emb): InternLM2DynamicNTKScalingRotaryEmbedding()
        )
        (feed_forward): InternLM2MLP(
          (w1): Linear(in_features=6144, out_features=16384, bias=False)
          (w3): Linear(in_features=6144, out_features=16384, bias=False)
          (w2): Linear(in_features=16384, out_features=6144, bias=False)
          (act_fn): SiLU()
        )
        (attention_norm): InternLM2RMSNorm()
        (ffn_norm): InternLM2RMSNorm()
      )
    )
    (norm): InternLM2RMSNorm()
  )
  (output): Linear(in_features=6144, out_features=92544, bias=False)
)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/autodl-tmp/quant.py", line 49, in <module>
    oneshot(
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/text_generation.py", line 76, in oneshot
    main(model_args, data_args, training_args)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/text_generation.py", line 364, in main
    stage_runner.one_shot()
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/runner.py", line 171, in one_shot
    self.trainer.one_shot(calibration_data=calib_data, stage=stage)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/session_mixin.py", line 416, in one_shot
    apply(
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/session_functions.py", line 184, in apply
    return active_session().apply(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/session.py", line 210, in apply
    self.initialize(**kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/session.py", line 156, in initialize
    mod_data = self._lifecycle.initialize(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/lifecycle.py", line 126, in initialize
    data = mod.initialize(state=self.state, **extras)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/stage.py", line 124, in initialize
    modifier.initialize(state, **kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/modifier.py", line 119, in initialize
    initialized = self.on_initialize(state=state, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/base.py", line 126, in on_initialize
    self.resolved_mappings_ = self._resolve_mappings(state.model)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/utils.py", line 75, in wrapper
    raise RuntimeError(
RuntimeError: Error resolving mappings for given architecture.Please refer to the README at /root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/README.md for more information.

still failed, sad.

Hey @rahul-tuli - could you provide some guidance on the SmoothQuant mappings here?

HelloCard · 2024-10-26T04:51:06Z

can't quant CohereForAI/aya-expanse-8b to W8A8, used latest llmcompressor_dev-0.2.0.dev0 code and python setup.py installed:

root@autodl-container-61bb4e82b9-21d3700f:~/autodl-tmp# python3 quant.py 
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.03s/it]
2024-10-26T12:40:05.631257+0800 | main | WARNING - Process rank: 0, device: cuda:0, n_gpu: 2, distributed training: True, 16-bits training: False
2024-10-26T12:40:05.631989+0800 | main | INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=2,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
clear_sparse_session=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_oneshot=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.NO,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
gradient_checkpointing_kwargs=None,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./output/runs/Oct26_12-40-05_autodl-container-61bb4e82b9-21d3700f,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_kwargs={},
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
no_cuda=False,
num_train_epochs=3.0,
oneshot_device=cuda:0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
optim_target_modules=None,
output_dir=./output,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
recipe=[SmoothQuantModifier(index=None, group=None, start=None, end=None, update=None, initialized_structure_=False, initialized_=False, finalized_=False, started_=False, ended_=False, smoothing_strength=0.8, mappings=None, ignore=None, num_calibration_steps=None, calibration_function=None, hooks_=None, resolved_mappings_=None, scales_=None), GPTQModifier(index=None, group=None, start=None, end=None, update=None, initialized_structure_=False, initialized_=False, finalized_=False, started_=False, ended_=False, sequential_update=True, targets='Linear', sequential_targets=None, block_size=128, quantize=True, dampening_frac=0.01, config_groups=None, ignore=['lm_head'], disable_quantization_observer_epoch=None, num_calibration_steps=None, scheme='W8A8', model=None, layer_compressors_=None, compressible_layers_=None, quantization_modifier_=None)],
recipe_args=None,
remove_unused_columns=True,
report_to=['tensorboard'],
restore_callback_states_from_checkpoint=False,
resume_from_checkpoint=None,
run_name=./output,
run_stages=False,
save_compressed=True,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
skip_memory_metrics=True,
split_batches=None,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torch_empty_cache_steps=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_liger_kernel=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
2024-10-26T12:40:06.181665+0800 | _check_create_state | INFO - State created for compression lifecycle
2024-10-26T12:40:06.184258+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-10-26T12:40:06.185970+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-10-26T12:40:06.281241+0800 | one_shot | INFO - *** One Shot ***
2024-10-26T12:40:06.285714+0800 | from_modifiers | INFO - Creating recipe from modifiers
cannot import name 'forward_quantize' from 'compressed_tensors.quantization.lifecycle.forward' (/root/miniconda3/lib/python3.12/site-packages/compressed_tensors/quantization/lifecycle/forward.py)
2024-10-26T12:40:06.307288+0800 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
2024-10-26T12:40:06.307409+0800 | _infer_mappings_from_model | INFO - No SmoothQuantModifier.mappings provided, inferring from model...
2024-10-26T12:40:06.307464+0800 | get_layer_mappings_from_architecture | INFO - Architecture CohereForCausalLM not found in mappings. Using default mappings: [LayerMap(balance_layers=['re:.*q_proj', 're:.*k_proj', 're:.*v_proj'], smooth_layers='re:.*input_layernorm'), LayerMap(balance_layers=['re:.*gate_proj', 're:.*up_proj'], smooth_layers='re:.*post_attention_layernorm')]
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/utils.py", line 72, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/base.py", line 178, in _resolve_mappings
    to_smooth_layers = get_layers(to_smooth, model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/utils/pytorch/module.py", line 166, in get_layers
    return match_layers_params(targets, module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/utils/pytorch/module.py", line 160, in match_layers_params
    raise ValueError(f"Could not find targets {missed} in module {module}")
ValueError: Could not find targets ['re:.*post_attention_layernorm'] in module CohereForCausalLM(
  (model): CohereModel(
    (embed_tokens): Embedding(256000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x CohereDecoderLayer(
        (self_attn): CohereSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): CohereRotaryEmbedding()
        )
        (mlp): CohereMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): CohereLayerNorm()
      )
    )
    (norm): CohereLayerNorm()
    (rotary_emb): CohereRotaryEmbedding()
  )
  (lm_head): Linear(in_features=4096, out_features=256000, bias=False)
)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/autodl-tmp/quant.py", line 48, in <module>
    oneshot(
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/text_generation.py", line 76, in oneshot
    main(model_args, data_args, training_args)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/text_generation.py", line 363, in main
    stage_runner.one_shot()
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/runner.py", line 171, in one_shot
    self.trainer.one_shot(calibration_data=calib_data, stage=stage)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/transformers/finetune/session_mixin.py", line 420, in one_shot
    apply(
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/session_functions.py", line 184, in apply
    return active_session().apply(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/session.py", line 210, in apply
    self.initialize(**kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/session.py", line 156, in initialize
    mod_data = self._lifecycle.initialize(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/core/lifecycle.py", line 126, in initialize
    data = mod.initialize(state=self.state, **extras)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/stage.py", line 124, in initialize
    modifier.initialize(state, **kwargs)
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/modifier.py", line 119, in initialize
    initialized = self.on_initialize(state=state, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/base.py", line 126, in on_initialize
    self.resolved_mappings_ = self._resolve_mappings(state.model)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/utils.py", line 75, in wrapper
    raise RuntimeError(
RuntimeError: Error resolving mappings for given architecture.Please refer to the README at /root/miniconda3/lib/python3.12/site-packages/llmcompressor_dev-0.2.0.dev0-py3.12.egg/llmcompressor/modifiers/smoothquant/README.md for more information.

HelloCard · 2024-10-26T05:19:40Z

I tried to manually locate post_attention_layernorm as described in llmcompressor/modifiers/smoothquant/README.md, but I couldn't. There is no post_attention_layernorm or anything with similar name in model.safetensors.index.json, and I can't find any comments related to "post" in transformers\models\cohere\modeling_cohere.py.

robertgshaw2-neuralmagic added the enhancement New feature or request label Aug 8, 2024

robertgshaw2-neuralmagic pinned this issue Aug 8, 2024

robertgshaw2-neuralmagic mentioned this issue Aug 11, 2024

Q3 ROADMAP #30

Open

21 tasks

MODEL REQUESTS #69

MODEL REQUESTS #69

Comments

robertgshaw2-neuralmagic commented Aug 8, 2024

BlackSamorez commented Aug 8, 2024

robertgshaw2-neuralmagic commented Aug 8, 2024 • edited Loading

BlackSamorez commented Aug 8, 2024

robertgshaw2-neuralmagic commented Aug 8, 2024

BlackSamorez commented Aug 8, 2024

robertgshaw2-neuralmagic commented Aug 8, 2024

edgan8 commented Aug 8, 2024

Syst3m1cAn0maly commented Aug 10, 2024

robertgshaw2-neuralmagic commented Aug 11, 2024

robertgshaw2-neuralmagic commented Aug 11, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024 • edited Loading

Lin-K76 commented Aug 12, 2024

robertgshaw2-neuralmagic commented Aug 12, 2024

yzlnew commented Aug 12, 2024 • edited Loading

robertgshaw2-neuralmagic commented Aug 12, 2024

supa-thibaud commented Aug 20, 2024

halexan commented Aug 22, 2024

robertgshaw2-neuralmagic commented Aug 22, 2024

sigjhl commented Aug 23, 2024

fengyang95 commented Aug 24, 2024

fengyang95 commented Aug 24, 2024 • edited Loading

robertgshaw2-neuralmagic commented Aug 24, 2024

halexan commented Aug 25, 2024 • edited Loading

fengyang95 commented Aug 25, 2024

robertgshaw2-neuralmagic commented Aug 25, 2024

halexan commented Aug 26, 2024

fengyang95 commented Aug 26, 2024

robertgshaw2-neuralmagic commented Aug 28, 2024

fengyang95 commented Aug 29, 2024 • edited Loading

fengyang95 commented Aug 30, 2024

robertgshaw2-neuralmagic commented Aug 30, 2024

fengyang95 commented Aug 30, 2024

fengyang95 commented Sep 10, 2024 • edited Loading

robertgshaw2-neuralmagic commented Sep 10, 2024

dsikka commented Sep 10, 2024

dsikka commented Sep 12, 2024

fengyang95 commented Sep 12, 2024

fengyang95 commented Sep 12, 2024

fengyang95 commented Sep 13, 2024 • edited Loading

dsikka commented Sep 13, 2024

fengyang95 commented Sep 13, 2024

TheAhmadOsman commented Sep 17, 2024

dsikka commented Sep 17, 2024

TheAhmadOsman commented Sep 18, 2024

TheAhmadOsman commented Sep 18, 2024

TheAhmadOsman commented Sep 24, 2024

dsikka commented Sep 27, 2024

mphilippnv commented Oct 8, 2024

Syst3m1cAn0maly commented Oct 9, 2024

HelloCard commented Oct 13, 2024

samos123 commented Oct 13, 2024 • edited Loading

mgoin commented Oct 14, 2024

samos123 commented Oct 14, 2024

robertgshaw2-neuralmagic commented Oct 15, 2024

robertgshaw2-neuralmagic commented Oct 15, 2024

HelloCard commented Oct 26, 2024

HelloCard commented Oct 26, 2024

robertgshaw2-neuralmagic commented Aug 8, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Aug 12, 2024 •

edited

Loading

yzlnew commented Aug 12, 2024 •

edited

Loading

fengyang95 commented Aug 24, 2024 •

edited

Loading

halexan commented Aug 25, 2024 •

edited

Loading

fengyang95 commented Aug 29, 2024 •

edited

Loading

fengyang95 commented Sep 10, 2024 •

edited

Loading

fengyang95 commented Sep 13, 2024 •

edited

Loading

samos123 commented Oct 13, 2024 •

edited

Loading