How to save and load ipex optimized model? #686

benja-matic · 2024-08-09T23:32:45Z

Describe the issue

Hi IPEX team,

I have an application where I want to serve multiple models concurrently, and I want to share weights across concurrent instances. I normally do this with torch.load(path, mmap=True). However, calling ipex.llm.optimize will interfere with weight sharing because ipex manipulates the weights in memory (does a deep copy from what I understand). I would like to instead save the ipex optimized model and load it (something like torch.load(ipex_model, mmap=True))). However, I can't figure out how to do this, and was hoping you could provide an example.

How to reproduce:

My miniconda env.yml file is listed below. pip install -r requirements.txt may not work here but you can create this env easily conda create -n ipex_issue python=3.10 && conda activate ipex_issue followed by the install instructions here and pip install transformers==4.38.1. I am using python 3.10 on an aws c7i.2xlarge instance.

certifi==2024.7.4
charset-normalizer==3.3.2
filelock==3.13.1
fsspec==2024.2.0
huggingface-hub==0.24.5
idna==3.7
intel_extension_for_pytorch==2.3.100
Jinja2==3.1.3
MarkupSafe==2.1.5
mpmath==1.3.0
networkx==3.2.1
numpy==1.26.3
oneccl-bind-pt==2.3.0+cpu
packaging==24.1
pillow==10.2.0
psutil==6.0.0
PyYAML==6.0.2
regex==2024.7.24
requests==2.32.3
safetensors==0.4.4
sympy==1.12
tokenizers==0.15.2
torch==2.3.0+cpu
torchaudio==2.3.0+cpu
torchvision==0.18.0+cpu
tqdm==4.66.5
transformers==4.38.1
typing_extensions==4.9.0
urllib3==2.2.2

Here are the things I have tried:

import os
import torch
from transformers import AutoConfig, AutoModelForCausalLM
import intel_extension_for_pytorch as ipex

config = AutoConfig.from_pretrained("HuggingFaceH4/zephyr-7b-beta", token=os.getenv("HF_ACCESS_TOKEN"))

# Just use a tiny model with random weights so it uses less mem, faster to test

config_micro = dict(
    hidden_size=768,
    intermediate_size=int(768*3.5),
    num_hidden_layers=4,
    num_attention_heads=8,
    num_key_value_heads=8,
)

for k, v in config_micro.items():
    setattr(config, k, v)

# NOTE: this is where we want to load the model with mmap=True to enable weight sharing
# EG model = AutoModelForCausalLM.from_pretrained(path, mmap=True)
model = AutoModelForCausalLM.from_config(config)
model.eval()
model.to(torch.bfloat16)


# NOTE: optimize deepcopies the model, breaks weights sharing / memory mapping
model = ipex.llm.optimize(
    model, dtype=torch.bfloat16, inplace=True, deployment_mode=True
)

###
# Several attempts to save / load
###

# 0) model.save_pretrained("save_pretrained")
model.save_pretrained("save_pretrained")
# error: RecursionError: maximum recursion depth exceeded while calling a Python object

# 1) model.save
model.save("model_save")
# error: AttributeError: 'MistralForCausalLM' object has no attribute 'save'

# 2) torch.save
torch.save(model, "torch_save_model.pt")
# error: RuntimeError: Tried to serialize object __torch__.transformers.models.mistral.modeling_mistral.___torch_mangle_65.MistralForCausalLM which does not have a __getstate__ method defined!

# 3) model.trace_graph.save()
model.trace_graph.save("model_trace_graph")
m3 = torch.jit.load("model_trace_graph")
inputs = torch.randint(low=500, high=1_000, size=(1, 16), dtype=torch.int64)
m3(inputs)
# error: RuntimeError: forward() is missing value for argument 'attention_mask'.

# 4) save jit traced
with torch.no_grad():
    traced_model = torch.jit.trace(model, inputs)
    # error: RecursionError: maximum recursion depth exceeded in comparison

As a side note, I understand you normally use subprocess to deploy multiple concurrent models, but this is not an option for my case because the logic that decides how and when to fork processes is separated from the part of the code that loads the model.

At some point I think I was able to get option 0) above to work, but the loaded model would be a vanilla transformer without ipex optimizations, and I also can't seem to reproduce that behavior at least in this env.

Any help would be much appreciated.

The text was updated successfully, but these errors were encountered:

ZhaoqiongZ · 2024-08-12T02:46:38Z

Hi @benja-matic , thank you for taking the time to try IPEX and for reporting the issue.
the correct way to save should be the third method you used model.trace_graph.save().
but for the load part, need make a little change, refer to https://github.com/intel/intel-extension-for-pytorch/blob/main/examples/cpu/inference/python/llm/single_instance/run_quantization.py#L1128-L1147
your code could be something like

import os
import torch
from transformers import AutoConfig, AutoModelForCausalLM
import intel_extension_for_pytorch as ipex

config = AutoConfig.from_pretrained("HuggingFaceH4/zephyr-7b-beta", token=os.getenv("HF_ACCESS_TOKEN"))

# Just use a tiny model with random weights so it uses less mem, faster to test

config_micro = dict(
    hidden_size=768,
    intermediate_size=int(768*3.5),
    num_hidden_layers=4,
    num_attention_heads=8,
    num_key_value_heads=8,
)

for k, v in config_micro.items():
    setattr(config, k, v)

# NOTE: this is where we want to load the model with mmap=True to enable weight sharing
# EG model = AutoModelForCausalLM.from_pretrained(path, mmap=True)
model = AutoModelForCausalLM.from_config(config)
model.eval()
model.to(torch.bfloat16)


# NOTE: optimize deepcopies the model, breaks weights sharing / memory mapping
model = ipex.llm.optimize(
    model, dtype=torch.bfloat16, inplace=True, deployment_mode=True
)

# 3) model.trace_graph.save()
model.trace_graph.save("model_trace_graph")
m3 = torch.jit.load("model_trace_graph")
m3 = torch.jit.freeze(m3.eval())
ipex._set_optimized_model_for_generation(model, optimized_model=m3)
inputs = torch.randint(low=500, high=1_000, size=(1, 16), dtype=torch.int64)
model(inputs)

benja-matic · 2024-08-16T17:50:23Z

Hi @ZhaoqiongZ , thanks for the fast reply. I tested the code you provided, and it runs without any errors. However, it doesn't solve the underlying issue of weight sharing across concurrent instances of a model. Weight sharing was the motivator for figuring out how to save and load IPEX models. I've been thrilled with the latency numbers I'm getting when using individual IPEX optimized models, so I'd love to be able to use IPEX for concurrent models. My use case involves two things: 1) a python script similar to the one above to load the model and process requests, and 2) a load balancer that runs parallel versions of the python script. The way I do weight sharing for non-ipex optimized models is torch.load("model.pt", mmap=True) in the python script (1). I'm curious what your thoughts are on how I could enable ipex in this use case.

ZhaoqiongZ · 2024-08-21T01:51:44Z

Hi @benja-matic , hope the following method meet your requirement, you can try to load the model in main process, and run parallel version in the multi-threading in the main process. Refer to the following code.

+    if args.benchmark:
+        if args.use_share_weight:
+            threads = []
+            num_instances = args.total_cores // args.cores_per_instance
+            for i in range(0, num_instances):
+               t = threading.Thread(target=benchmark_evaluate, args=(args, model, eval_dataloader))
+               threads.append(t)
+               t.start()
+            for t in threads:
+                t.join()

girishponkiya · 2024-08-21T09:51:19Z

I'm bit confused there; may be I didn't understand it clearly.

Just to load a quantized model, I need to run the following code:

m3 = torch.jit.load("model_trace_graph")
m3 = torch.jit.freeze(m3.eval())
ipex._set_optimized_model_for_generation(model, optimized_model=m3)

But, what is model here? If I use additional code (like model = ipex.llm.optimize(...)) to create model variable, then I'm doing optimization again, isn't it? If so, then the sole purpose of saving the optimized model to avoid quantization again is lost. Please help me understand what I'm missing here..

ZhaoqiongZ · 2024-08-22T09:55:23Z

Hi @jianan-gu , please help on this issue.

azhuvath · 2024-08-27T13:26:57Z

m3 = torch.jit.load("model_trace_graph")
m3 = torch.jit.freeze(m3.eval())
ipex._set_optimized_model_for_generation(model, optimized_model=m3)

I am also confused. I perform the quantization as below. It is not clear how to load these models and why are we loading the original model and not just the quantized model.

OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m mistralai/Mistral-7B-Instruct-v0.2 --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results/INT8"

OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m mistralai/Mistral-7B-Instruct-v0.2 --ipex-weight-only-quantization --weight-dtype INT4 --gptq --quant-with-amp --output-dir "saved_results/INT4"

There are two models getting created after the above steps.
./saved_results/INT8/best_model.pt
./saved_results/INT4/best_model.pt

girishponkiya · 2024-09-02T11:02:17Z

I'm bit confused there; may be I didn't understand it clearly.

Just to load a quantized model, I need to run the following code:
m3 = torch.jit.load("model_trace_graph")
m3 = torch.jit.freeze(m3.eval())
ipex._set_optimized_model_for_generation(model, optimized_model=m3)
But, what is model here? If I use additional code (like model = ipex.llm.optimize(...)) to create model variable, then I'm doing optimization again, isn't it? If so, then the sole purpose of saving the optimized model to avoid quantization again is lost. Please help me understand what I'm missing here..

Hi @jianan-gu, please help on this issue.

benja-matic · 2024-09-03T23:01:12Z

Hi @benja-matic , hope the following method meet your requirement, you can try to load the model in main process, and run parallel version in the multi-threading in the main process. Refer to the following code.
+    if args.benchmark:
+        if args.use_share_weight:
+            threads = []
+            num_instances = args.total_cores // args.cores_per_instance
+            for i in range(0, num_instances):
+               t = threading.Thread(target=benchmark_evaluate, args=(args, model, eval_dataloader))
+               threads.append(t)
+               t.start()
+            for t in threads:
+                t.join()

Hi @ZhaoqiongZ , in my use case, I don't have control of forking processes. Is there any way to share weights across multiple models if another process is responsible for forking?

My application is based on the NVIDIA triton inference server. I typically have multiple models running concurrently. This means that the Triton Server calls the initialize method on each concurrent model to load it into memory, and as requests from the user come in, the Triton Server schedules those requests across models. For a non-ipex-optimized model, I put torch.load("model.pt", mmap=True) in the initialize method of the python backed. I'm looking for something comparable I can do with IPEX. I'm also looking into rather or not this is possible if I use a different backend, like vLLM and LibTorch.

Does that make sense? Please let me know if I can further clarify. Thanks again for the support.

jianan-gu · 2024-09-05T07:48:42Z

I'm bit confused there; may be I didn't understand it clearly.

Just to load a quantized model, I need to run the following code:
m3 = torch.jit.load("model_trace_graph")
m3 = torch.jit.freeze(m3.eval())
ipex._set_optimized_model_for_generation(model, optimized_model=m3)
But, what is model here? If I use additional code (like model = ipex.llm.optimize(...)) to create model variable, then I'm doing optimization again, isn't it? If so, then the sole purpose of saving the optimized model to avoid quantization again is lost. Please help me understand what I'm missing here..

Hi @girishponkiya

When deployment or benchmark only (quantization is done), we still need an object of model to apply our optimizations for generations (like for beam search), while model optimization itself is done (no need to repeat) as long as you save the best_model.pt or model_trace_graph.

More details:
We can init HF model on meta device (no real weights loaded, we just need the object of generation functions form HF model class), see here link , then we use model = ipex.llm.optimize(...) to apply optimizations for generations, and it will not quant or optimize model again see link and link

jianan-gu · 2024-09-05T07:53:00Z

m3 = torch.jit.load("model_trace_graph")
m3 = torch.jit.freeze(m3.eval())
ipex._set_optimized_model_for_generation(model, optimized_model=m3)

I am also confused. I perform the quantization as below. It is not clear how to load these models and why are we loading the original model and not just the quantized model.

OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m mistralai/Mistral-7B-Instruct-v0.2 --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results/INT8"

OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m mistralai/Mistral-7B-Instruct-v0.2 --ipex-weight-only-quantization --weight-dtype INT4 --gptq --quant-with-amp --output-dir "saved_results/INT4"

There are two models getting created after the above steps. ./saved_results/INT8/best_model.pt ./saved_results/INT4/best_model.pt

Hi @azhuvath
If you have quantized model "best_model.pt", see notes here , just add args of "--quantized-model-path <output_dir + "best_model.pt">" in your command, you can skip the quantization stage and just to run benchmark with your local optimized models. (which is described in above details of what has been done #686 (comment))

jianan-gu · 2024-09-05T08:28:38Z

Hi @benja-matic

# 3) model.trace_graph.save()
model.trace_graph.save("model_trace_graph")
m3 = torch.jit.load("model_trace_graph")
inputs = torch.randint(low=500, high=1_000, size=(1, 16), dtype=torch.int64)
m3(inputs)
# error: RuntimeError: forward() is missing value for argument 'attention_mask'.

For 3) suppose to work, here we are optimizing the model for generation with ipex.llm.optimize, which means we are calling model.generate(args) to run the optimzed model instead of model forward as above, is it matching your usage?

And torch.jit.load has map_location, supposing we could map the optimized model to the same location.

Besides, with flag inplace=True, we will not deepcopy the model while just optimize it inplace, what detailed condition did you see that kind of deepcopy? Thanks.

azhuvath · 2024-09-05T13:13:26Z

m3 = torch.jit.load("model_trace_graph")
m3 = torch.jit.freeze(m3.eval())
ipex._set_optimized_model_for_generation(model, optimized_model=m3)

I am also confused. I perform the quantization as below. It is not clear how to load these models and why are we loading the original model and not just the quantized model.
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m mistralai/Mistral-7B-Instruct-v0.2 --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results/INT8"
OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m mistralai/Mistral-7B-Instruct-v0.2 --ipex-weight-only-quantization --weight-dtype INT4 --gptq --quant-with-amp --output-dir "saved_results/INT4"
There are two models getting created after the above steps. ./saved_results/INT8/best_model.pt ./saved_results/INT4/best_model.pt

Hi @azhuvath If you have quantized model "best_model.pt", see notes here , just add args of "--quantized-model-path <output_dir + "best_model.pt">" in your command, you can skip the quantization stage and just to run benchmark with your local optimized models. (which is described in above details of what has been done #686 (comment))

@jianan-gu I am not using any command line tools for inference. The inference is a custom code written in Python. I am seeing the performance worse after the quantization which is not the case with other frameworks like OpenVINO. Not sure what is the correct Python code for inferring a quantized model.

ZhaoqiongZ self-assigned this Aug 12, 2024

ZhaoqiongZ added Documentation Improvements or additions to documentation CPU CPU specific issues NotAnIssue LLM labels Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to save and load ipex optimized model? #686

How to save and load ipex optimized model? #686

benja-matic commented Aug 9, 2024

ZhaoqiongZ commented Aug 12, 2024 •

edited

Loading

benja-matic commented Aug 16, 2024 •

edited

Loading

ZhaoqiongZ commented Aug 21, 2024

girishponkiya commented Aug 21, 2024 •

edited

Loading

ZhaoqiongZ commented Aug 22, 2024

azhuvath commented Aug 27, 2024

girishponkiya commented Sep 2, 2024 •

edited

Loading

benja-matic commented Sep 3, 2024

jianan-gu commented Sep 5, 2024 •

edited

Loading

jianan-gu commented Sep 5, 2024 •

edited

Loading

jianan-gu commented Sep 5, 2024

azhuvath commented Sep 5, 2024

How to save and load ipex optimized model? #686

How to save and load ipex optimized model? #686

Comments

benja-matic commented Aug 9, 2024

Describe the issue

ZhaoqiongZ commented Aug 12, 2024 • edited Loading

benja-matic commented Aug 16, 2024 • edited Loading

ZhaoqiongZ commented Aug 21, 2024

girishponkiya commented Aug 21, 2024 • edited Loading

ZhaoqiongZ commented Aug 22, 2024

azhuvath commented Aug 27, 2024

girishponkiya commented Sep 2, 2024 • edited Loading

benja-matic commented Sep 3, 2024

jianan-gu commented Sep 5, 2024 • edited Loading

jianan-gu commented Sep 5, 2024 • edited Loading

jianan-gu commented Sep 5, 2024

azhuvath commented Sep 5, 2024

ZhaoqiongZ commented Aug 12, 2024 •

edited

Loading

benja-matic commented Aug 16, 2024 •

edited

Loading

girishponkiya commented Aug 21, 2024 •

edited

Loading

girishponkiya commented Sep 2, 2024 •

edited

Loading

jianan-gu commented Sep 5, 2024 •

edited

Loading

jianan-gu commented Sep 5, 2024 •

edited

Loading