-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to save and load ipex optimized model? #686
Comments
Hi @benja-matic , thank you for taking the time to try IPEX and for reporting the issue.
|
Hi @ZhaoqiongZ , thanks for the fast reply. I tested the code you provided, and it runs without any errors. However, it doesn't solve the underlying issue of weight sharing across concurrent instances of a model. Weight sharing was the motivator for figuring out how to save and load IPEX models. I've been thrilled with the latency numbers I'm getting when using individual IPEX optimized models, so I'd love to be able to use IPEX for concurrent models. My use case involves two things: 1) a python script similar to the one above to load the model and process requests, and 2) a load balancer that runs parallel versions of the python script. The way I do weight sharing for non-ipex optimized models is |
Hi @benja-matic , hope the following method meet your requirement, you can try to load the model in main process, and run parallel version in the multi-threading in the main process. Refer to the following code.
|
I'm bit confused there; may be I didn't understand it clearly. Just to load a quantized model, I need to run the following code: m3 = torch.jit.load("model_trace_graph")
m3 = torch.jit.freeze(m3.eval())
ipex._set_optimized_model_for_generation(model, optimized_model=m3) But, what is |
Hi @jianan-gu , please help on this issue. |
I am also confused. I perform the quantization as below. It is not clear how to load these models and why are we loading the original model and not just the quantized model. OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m mistralai/Mistral-7B-Instruct-v0.2 --ipex-weight-only-quantization --weight-dtype INT8 --quant-with-amp --output-dir "saved_results/INT8" OMP_NUM_THREADS=56 numactl -m 0 -C 0-55 python run.py --benchmark -m mistralai/Mistral-7B-Instruct-v0.2 --ipex-weight-only-quantization --weight-dtype INT4 --gptq --quant-with-amp --output-dir "saved_results/INT4" There are two models getting created after the above steps. |
Hi @jianan-gu, please help on this issue. |
Hi @ZhaoqiongZ , in my use case, I don't have control of forking processes. Is there any way to share weights across multiple models if another process is responsible for forking? My application is based on the NVIDIA triton inference server. I typically have multiple models running concurrently. This means that the Triton Server calls the Does that make sense? Please let me know if I can further clarify. Thanks again for the support. |
When deployment or benchmark only (quantization is done), we still need an object of More details: |
Hi @azhuvath |
Hi @benja-matic
For 3) suppose to work, here we are optimizing the model for generation with And torch.jit.load has Besides, with flag |
@jianan-gu I am not using any command line tools for inference. The inference is a custom code written in Python. I am seeing the performance worse after the quantization which is not the case with other frameworks like OpenVINO. Not sure what is the correct Python code for inferring a quantized model. |
Describe the issue
Hi IPEX team,
I have an application where I want to serve multiple models concurrently, and I want to share weights across concurrent instances. I normally do this with
torch.load(path, mmap=True)
. However, callingipex.llm.optimize
will interfere with weight sharing because ipex manipulates the weights in memory (does a deep copy from what I understand). I would like to instead save the ipex optimized model and load it (something liketorch.load(ipex_model, mmap=True))
). However, I can't figure out how to do this, and was hoping you could provide an example.How to reproduce:
My miniconda env.yml file is listed below.
pip install -r requirements.txt
may not work here but you can create this env easilyconda create -n ipex_issue python=3.10 && conda activate ipex_issue
followed by the install instructions here andpip install transformers==4.38.1
. I am using python 3.10 on an aws c7i.2xlarge instance.Here are the things I have tried:
As a side note, I understand you normally use subprocess to deploy multiple concurrent models, but this is not an option for my case because the logic that decides how and when to fork processes is separated from the part of the code that loads the model.
At some point I think I was able to get option 0) above to work, but the loaded model would be a vanilla transformer without ipex optimizations, and I also can't seem to reproduce that behavior at least in this env.
Any help would be much appreciated.
The text was updated successfully, but these errors were encountered: