-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
imatrix : offload to GPU support #4957
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Tested on LLaMA-v1-7B. With 100 chunks calculation is ready in 52 seconds on the GPU (vs 9 minutes on the CPU)! Using the GPU imatrix with IQ2_XXS
I get PPL = 8.5372
. With the CPU imatrix I had PPL = 8.5435
. So, it looks like it is working.
ROCm also seems to work on Linux.
|
I've just made my first GGUF repo that uses the new imatrix method, here: https://huggingface.co/TheBloke/Yi-34B-200K-DARE-megamerge-v8-GGUF I used this PR so as to speed up the imatrix creation. On a 34B model, with a 5000-line (76,859 word - I didn't count the tokens) dataset, it took 21 minutes with I used this command:
Hope I did it right! The model is coherent at least! :) |
@TheBloke My experience is that it is better to use a context of 512 when computing the imatrix. When running inference with a quantized model where the imatrix calculation used a context of 512, I get a lower perplexity even for a context of 4096. Not sure why this is the case. |
Ah interesting, thanks for the info. I've not done any PPL testing on the result yet. I'll try 512 next time then, thanks. |
My theory is that having more unique contexts in total is beneficial because it makes the diversity of how many "starting contexts" there are significantly larger, and therefore, you get more unique data for activations. |
@ggerganov No errors generated, however the file size was one tenth of the CPU only run. Then, when i tried to quantize using the new imatrix, it couldnt find a boatload of layers in the matrix file. Worked fine for CPU only. Let me know if you require any further details. Loading info [and command used]
Final line
UPDATE: UPDATE2: |
Yes, there is an issue with Mixtral - will look into fixing it |
If I try Mixtral-8x7B with partial offload with
|
ggml-ci
Mixtral should be fixed now - the |
ggml-ci
Did a quick test to 20 chunks and built a quant with it. Working now. Thanks for that. |
This is great. I computed an imatrix with 1000 chunks around 10 minutes for a 13B/14B model. THis allows us doing some extensive experimetn on large calibration dataset. |
* backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * imatrix : offload to GPU support * imatrix : fix ggml_mul_mat_id hanlding ggml-ci * ci : add imatrix test ggml-ci * ci : rearrange output ggml-ci
Thanks for the exact command, but what is the content of "open-instruct-5K.txt"? Can I use for imatrix a part of the dataset I finetuned with? What if I finetune in chat format and therefore is not free text (ex: wiki) - how should I format the txt I use for imatrix? I think @TheBloke is on vacation these days, but if anyone else has some hints/clarifications, they would be much appreciated. |
* backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * imatrix : offload to GPU support * imatrix : fix ggml_mul_mat_id hanlding ggml-ci * ci : add imatrix test ggml-ci * ci : rearrange output ggml-ci
How to use ./perplexity to measure the model after "imatrix". get the Mistral-q4-imatrix.gguf model and run this command And I try mv Mistral-q4-imatrix.gguf Mistral-q4-imatrix.imatrix And Also get the same error. I want to ask if this support perplexity |
close #4931
Make use of the new backend scheduler eval callback introduced in #4935 in order to grab activations from the GPU memory.
Usage:
The performance should be significantly faster. I haven't confirmed the correctness of the results yet, so please let me know when you give this a try and see if the numbers are as expected.