Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization with lora weights #467

Open
xinyual opened this issue Dec 6, 2023 · 5 comments
Open

Quantization with lora weights #467

xinyual opened this issue Dec 6, 2023 · 5 comments

Comments

@xinyual
Copy link

xinyual commented Dec 6, 2023

I have a mistral model with lora weights. Is there any way I can quantization the whole model with lora weights?
I try this step but meet problems

model = MistralGPTQForCausalLM.from_pretrained(base_model, quantize_config)
model = PeftModel.from_pretrained(
        model,
        lora_weights,
        adapter_name = "dsl1"
            )
print("start quantize")
model.quantize(examples)

When I load and use do_sample to generate like:

with torch.no_grad():
    generation_output = model.generate(
    input_ids=input_ids,
    do_sample=True,
    temperature=0.01
    )

It calls RuntimeError: probability tensor contains either inf, nan or element < 0

@fxmarty
Copy link
Collaborator

fxmarty commented Dec 7, 2023

Hi, likely related: #295 & huggingface/transformers#27179

@fxmarty
Copy link
Collaborator

fxmarty commented Dec 7, 2023

Could you provide a reproduction?

@xinyual
Copy link
Author

xinyual commented Dec 13, 2023

Sorry for late reply.

tokenizer = AutoTokenizer.from_pretrained(base_model, use_fast=True)
examples = [
    tokenizer(
        prompt
    )
    ]
quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
)
model = MistralGPTQForCausalLM.from_pretrained(base_model, quantize_config)
model = PeftModel.from_pretrained(
        model,
        lora_weights
            )
model.quantize(examples)

Then:

model = MistralGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:1")
with torch.no_grad():
    generation_output = model.generate(
    input_ids=input_ids,
    do_sample=True,
    top_k=top_k,
    top_p=top_p,
    max_length=2500 + 100,
    temperature=0.01
    )

@fxmarty
Copy link
Collaborator

fxmarty commented Dec 13, 2023

Thank you! What is your base_model? Is there an already quantized model available on HF Hub for which we can reproduce maybe?

@xinyual
Copy link
Author

xinyual commented Dec 14, 2023

It's mistralai/Mistral-7B-Instruct-v0.1 from huggingface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants