imatrix : offload to GPU support #4957

ggerganov · 2024-01-15T14:49:08Z

Make use of the new backend scheduler eval callback introduced in #4935 in order to grab activations from the GPU memory.

Usage:

# Metal
make -j imatrix && ./imatrix -m model.gguf -f data.txt -ngl 99

# CUDA
LLAMA_CUBLAS=1 make -j imatrix && ./imatrix -m model.gguf -f data.txt -ngl 99

The performance should be significantly faster. I haven't confirmed the correctness of the results yet, so please let me know when you give this a try and see if the numbers are as expected.

ggml-ci

ikawrakow

Cool!

Tested on LLaMA-v1-7B. With 100 chunks calculation is ready in 52 seconds on the GPU (vs 9 minutes on the CPU)! Using the GPU imatrix with IQ2_XXS I get PPL = 8.5372. With the CPU imatrix I had PPL = 8.5435. So, it looks like it is working.

Artefact2 · 2024-01-15T17:45:49Z

ROCm also seems to work on Linux.

make LLAMA_HIPBLAS=1 AMDGPU_TARGETS=gfx1030 imatrix

compute_imatrix: 1.41 seconds per pass - ETA 2 hours 9.52 minutes
[1]5.3541,[2]7.6622,[3]8.4022,[4]9.5683,[5]10.0294

make imatrix

compute_imatrix: 27.02 seconds per pass - ETA 41 hours 26.30 minutes
[1]5.3145,[2]7.6356,[3]8.4072,[4]9.5662,[5]10.0225

TheBloke · 2024-01-15T20:41:21Z

I've just made my first GGUF repo that uses the new imatrix method, here: https://huggingface.co/TheBloke/Yi-34B-200K-DARE-megamerge-v8-GGUF

I used this PR so as to speed up the imatrix creation.

On a 34B model, with a 5000-line (76,859 word - I didn't count the tokens) dataset, it took 21 minutes with -c 4096 -b 1024 on an L40 48GB GPU, and -t 10

I used this command:

CUDA_VISIBLE_DEVICES=7  ./imatrix -m /workspace/process/brucethemoose_yi-34b-200k-dare-megamerge-v8/gguf/yi-34b-200k-dare-megamerge-v8.fp16.gguf -f /workspace/datasets/open-instruct-5K.txt -o /workspace/process/brucethemoose_yi-34b-200k-dare-megamerge-v8/gguf/yi-34b-200k-dare-megamerge-v8.imatrix -t 10 -c 4096 -ngl 35 -b 1024

Hope I did it right! The model is coherent at least! :)

ikawrakow · 2024-01-16T09:32:28Z

@TheBloke My experience is that it is better to use a context of 512 when computing the imatrix. When running inference with a quantized model where the imatrix calculation used a context of 512, I get a lower perplexity even for a context of 4096. Not sure why this is the case.

TheBloke · 2024-01-16T09:43:33Z

Ah interesting, thanks for the info. I've not done any PPL testing on the result yet.

I'll try 512 next time then, thanks.

kalomaze · 2024-01-16T20:17:44Z

Not sure why this is the case.

My theory is that having more unique contexts in total is beneficial because it makes the diversity of how many "starting contexts" there are significantly larger, and therefore, you get more unique data for activations.
Maybe the sweetspot is like 128-256 tokens...

askmyteapot · 2024-01-17T08:39:52Z

@ggerganov
Just did a test. Does not work with partial GPU offloading.
I attempted doing it on a Mixtral 8x7B instruct at 8bit.

No errors generated, however the file size was one tenth of the CPU only run. Then, when i tried to quantize using the new imatrix, it couldnt find a boatload of layers in the matrix file. Worked fine for CPU only.

Let me know if you require any further details.

Loading info [and command used]

D:\llama.cpp - Copy\build\bin\Release>imatrix -m D:\text-generation-webui\models\Mixtral-8x7B-Instruct-v0.1\ggml-model-q8.gguf -f wiki.train.raw -ngl 15 -c 512 -o D:\text-generation-webui\models\Mixtral-8x7B-Instruct-v0.1\ggml-model-q8.gguf.imat.data --chunks 2000
main: build = 1884 (0b2fca9a)
main: built with MSVC 19.38.33133.0 for x64
main: seed  = 1705443892
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
llama_model_loader: loaded meta data with 25 key-value pairs and 995 tensors from D:\text-generation-webui\models\Mixtral-8x7B-Instruct-v0.1\ggml-model-q8.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = F:\LLM_Models
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                         llama.expert_count u32              = 8
llama_model_loader: - kv  11:                    llama.expert_used_count u32              = 2
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 7
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:   32 tensors
llama_model_loader: - type q8_0:  898 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 46.22 GiB (8.50 BPW)
llm_load_print_meta: general.name     = F:\LLM_Models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.76 MiB
llm_load_tensors: offloading 15 repeating layers to GPU
llm_load_tensors: offloaded 15/33 layers to GPU
llm_load_tensors:        CPU buffer size = 47324.64 MiB
llm_load_tensors:      CUDA0 buffer size = 22058.91 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    34.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    30.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model: graph splits (measure): 5
llama_new_context_with_model:      CUDA0 compute buffer size =   108.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   109.04 MiB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 6012.49 ms
compute_imatrix: computing over 2000 chunks with batch_size 512
compute_imatrix: 17.95 seconds per pass - ETA 9 hours 58.30 minutes

Final line

save_imatrix: stored collected data after 2000 chunks in D:\text-generation-webui\models\Mixtral-8x7B-Instruct-v0.1\ggml-model-q8.gguf.imat.data
[2000]4.7332,
Final estimate: PPL = 4.7332 +/- 0.01479

save_imatrix: stored collected data after 2000 chunks in D:\text-generation-webui\models\Mixtral-8x7B-Instruct-v0.1\ggml-model-q8.gguf.imat.data

llama_print_timings:        load time =   35827.98 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 25796936.71 ms / 1024000 tokens (   25.19 ms per token,    39.69 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 25838386.90 ms / 1024001 tokens

UPDATE:
I can't replicate it with Mistral 7b instruct, but can replicate it with mixtral 8x7b. Will do some more testing.

UPDATE2:
It's only Mixtral 8x7b that has the failure. I tested Mistral 7b with various combinations of f16/q8_0 gguf, full and partial offloading. All tested Mistral 7b combinations resulted in expected file size and outcome.
So there is something with the GPU offloading change that has impacted calculating imatrix in MoE models.

ggerganov · 2024-01-17T10:37:16Z

Yes, there is an issue with Mixtral - will look into fixing it

ikawrakow · 2024-01-17T10:37:24Z

If I try Mixtral-8x7B with partial offload with --verbosity 2, I see that only the attention related activation are collected:

...
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/33 layers to GPU
llm_load_tensors:        CPU buffer size = 47324.64 MiB
llm_load_tensors:      CUDA0 buffer size = 14705.94 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    44.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    20.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model: graph splits (measure): 5
llama_new_context_with_model:      CUDA0 compute buffer size =   108.03 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   109.04 MiB

system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 7285.27 ms
compute_imatrix: computing over 100 chunks with batch_size 512
collect_imatrix[0]: blk.0.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.0.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.0.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.0.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.0.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.1.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.1.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.1.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.1.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.1.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.2.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.2.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.2.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.2.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.2.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.3.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.3.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.3.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.3.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.3.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.4.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.4.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.4.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.4.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.4.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.5.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.5.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.5.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.5.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.5.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.6.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.6.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.6.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.6.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.6.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.7.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.7.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.7.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.7.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.7.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.8.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.8.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.8.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.8.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.8.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.9.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.9.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.9.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.9.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.9.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.10.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.10.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.10.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.10.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.10.ffn_gate_inp.weight, 4096 x 512, 0
collect_imatrix[1]: blk.11.attn_q.weight, 4096 x 512, 0
collect_imatrix[1]: blk.11.attn_k.weight, 4096 x 512, 0
collect_imatrix[1]: blk.11.attn_v.weight, 4096 x 512, 0
collect_imatrix[1]: blk.11.attn_output.weight, 4096 x 512, 0
collect_imatrix[1]: blk.11.ffn_gate_inp.weight, 4096 x 512, 0
...

ggml-ci

ggerganov · 2024-01-17T13:16:02Z

Mixtral should be fixed now - the MUL_MAT_ID ops weren't handled in the callback. Please give it a try again

ggml-ci

askmyteapot · 2024-01-17T14:07:55Z

Mixtral should be fixed now - the MUL_MAT_ID ops weren't handled in the callback. Please give it a try again

Did a quick test to 20 chunks and built a quant with it. Working now.

Thanks for that.

JianbangZ · 2024-01-17T16:09:31Z

This is great. I computed an imatrix with 1000 chunks around 10 minutes for a 13B/14B model. THis allows us doing some extensive experimetn on large calibration dataset.

ggml-ci

* backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * imatrix : offload to GPU support * imatrix : fix ggml_mul_mat_id hanlding ggml-ci * ci : add imatrix test ggml-ci * ci : rearrange output ggml-ci

Mihaiii · 2024-02-12T23:31:35Z

I've just made my first GGUF repo that uses the new imatrix method, here: https://huggingface.co/TheBloke/Yi-34B-200K-DARE-megamerge-v8-GGUF

I used this PR so as to speed up the imatrix creation.

On a 34B model, with a 5000-line (76,859 word - I didn't count the tokens) dataset, it took 21 minutes with -c 4096 -b 1024 on an L40 48GB GPU, and -t 10

I used this command:
CUDA_VISIBLE_DEVICES=7  ./imatrix -m /workspace/process/brucethemoose_yi-34b-200k-dare-megamerge-v8/gguf/yi-34b-200k-dare-megamerge-v8.fp16.gguf -f /workspace/datasets/open-instruct-5K.txt -o /workspace/process/brucethemoose_yi-34b-200k-dare-megamerge-v8/gguf/yi-34b-200k-dare-megamerge-v8.imatrix -t 10 -c 4096 -ngl 35 -b 1024
Hope I did it right! The model is coherent at least! :)

Thanks for the exact command, but what is the content of "open-instruct-5K.txt"? Can I use for imatrix a part of the dataset I finetuned with? What if I finetune in chat format and therefore is not free text (ex: wiki) - how should I format the txt I use for imatrix?

I think @TheBloke is on vacation these days, but if anyone else has some hints/clarifications, they would be much appreciated.

* backend : add eval callback ggml-ci * backend : group nodes in a single compute when user don't need them * backend : clean-up the implementation ggml-ci * simple : do not perform tensor data copy if not needed * simple : fix * imatrix : offload to GPU support * imatrix : fix ggml_mul_mat_id hanlding ggml-ci * ci : add imatrix test ggml-ci * ci : rearrange output ggml-ci

Iridescent-gcrace · 2024-05-31T04:06:05Z

How to use ./perplexity to measure the model after "imatrix".
I use this command to imatrix the model.
CUDA_VISIBLE_DEVICES=0 LLAMA_CUBLAS=1 ./imatrix -m models/Mistral-7B-v0.1/Mistral-7B-v0.1-7B-F32.gguf -o models/Mistral-7B-v0.1/Mistral-q4-imatrix.gguf -f tests/wikitext-2-raw/wiki.train.txt -ngl 99

get the Mistral-q4-imatrix.gguf model and run this command
./perplexity -m models/Mistral-7B-v0.1/Mistral-q4-imatrix.gguf -f tests/wikitext-2-raw/wiki.test.raw
But I get "[can't load model](error: unable to load model)"

And I try mv Mistral-q4-imatrix.gguf Mistral-q4-imatrix.imatrix And Also get the same error.

I want to ask if this support perplexity

ggerganov added 6 commits January 15, 2024 16:24

backend : add eval callback

65648b3

ggml-ci

backend : group nodes in a single compute when user don't need them

01b6f68

backend : clean-up the implementation

83f3d7a

ggml-ci

simple : do not perform tensor data copy if not needed

e1b1db9

simple : fix

e049380

imatrix : offload to GPU support

0b2fca9

ggerganov requested a review from ikawrakow January 15, 2024 14:49

ikawrakow approved these changes Jan 15, 2024

View reviewed changes

ikawrakow mentioned this pull request Jan 16, 2024

Importance Matrix calculation #4861

Merged

ggerganov added 2 commits January 17, 2024 14:43

imatrix : fix ggml_mul_mat_id hanlding

a722d05

ggml-ci

ci : add imatrix test

10b25e0

ggml-ci

ci : rearrange output

4fb5284

ggml-ci

Base automatically changed from gg/sched-eval-callback-4931 to master January 17, 2024 16:39

Merge branch 'master' into gg/imatrix-gpu-4931

2917e6b

ggml-ci

ggerganov merged commit ba69bbc into master Jan 17, 2024
36 of 47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imatrix : offload to GPU support #4957

imatrix : offload to GPU support #4957

ggerganov commented Jan 15, 2024

ikawrakow left a comment

Artefact2 commented Jan 15, 2024

TheBloke commented Jan 15, 2024 •

edited

Loading

ikawrakow commented Jan 16, 2024

TheBloke commented Jan 16, 2024

kalomaze commented Jan 16, 2024 •

edited

Loading

askmyteapot commented Jan 17, 2024 •

edited

Loading

ggerganov commented Jan 17, 2024

ikawrakow commented Jan 17, 2024

ggerganov commented Jan 17, 2024

askmyteapot commented Jan 17, 2024

JianbangZ commented Jan 17, 2024 •

edited

Loading

Mihaiii commented Feb 12, 2024 •

edited

Loading

Iridescent-gcrace commented May 31, 2024

imatrix : offload to GPU support #4957

imatrix : offload to GPU support #4957

Conversation

ggerganov commented Jan 15, 2024

ikawrakow left a comment

Choose a reason for hiding this comment

Artefact2 commented Jan 15, 2024

TheBloke commented Jan 15, 2024 • edited Loading

ikawrakow commented Jan 16, 2024

TheBloke commented Jan 16, 2024

kalomaze commented Jan 16, 2024 • edited Loading

askmyteapot commented Jan 17, 2024 • edited Loading

ggerganov commented Jan 17, 2024

ikawrakow commented Jan 17, 2024

ggerganov commented Jan 17, 2024

askmyteapot commented Jan 17, 2024

JianbangZ commented Jan 17, 2024 • edited Loading

Mihaiii commented Feb 12, 2024 • edited Loading

Iridescent-gcrace commented May 31, 2024

TheBloke commented Jan 15, 2024 •

edited

Loading

kalomaze commented Jan 16, 2024 •

edited

Loading

askmyteapot commented Jan 17, 2024 •

edited

Loading

JianbangZ commented Jan 17, 2024 •

edited

Loading

Mihaiii commented Feb 12, 2024 •

edited

Loading