Bug: Cannot load DeepSeek-Coder-V2-Instruct #8174

MarsBlessed · 2024-06-27T16:58:50Z

What happened?

I am trying to use a quantized (q2_k) version of DeepSeek-Coder-V2-Instruct and it fails to load model completly - the process was killed every time I tried to run it after some time

Name and Version

./llama-cli --version
version: 3253 (ab36791)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0

What operating system are you seeing the problem on?

Mac

Relevant log output

./llama-cli -m ./models/DeepSeek-Coder-V2-Instruct_q2_K.gguf --color -i --multiline-input --log-enable -p "just say hallo"
Log start
main: build = 3253 (ab367911)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.5.0
main: seed  = 1719507332
llama_model_loader: loaded meta data with 39 key-value pairs and 959 tensors from ./models/DeepSeek-Coder-V2-Instruct_q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = DeepSeek-Coder-V2-Instruct
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 60
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 5120
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 12288
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000,000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0,000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 10
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  15:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  16:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  17:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  18:       deepseek2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  19:                     deepseek2.expert_count u32              = 160
llama_model_loader: - kv  20:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  21:             deepseek2.expert_weights_scale f32              = 16,000000
llama_model_loader: - kv  22:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  23:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  24:              deepseek2.rope.scaling.factor f32              = 40,000000
llama_model_loader: - kv  25: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  26: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0,100000
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  34:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  35:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  300 tensors
llama_model_loader: - type q2_K:  479 tensors
llama_model_loader: - type q3_K:  179 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 2400
llm_load_vocab: token to piece cache size = 0,6661 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 102400
llm_load_print_meta: n_merges         = 99757
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_layer          = 60
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0,0e+00
llm_load_print_meta: f_norm_rms_eps   = 1,0e-06
llm_load_print_meta: f_clamp_kqv      = 0,0e+00
llm_load_print_meta: f_max_alibi_bias = 0,0e+00
llm_load_print_meta: f_logit_scale    = 0,0e+00
llm_load_print_meta: n_ff             = 12288
llm_load_print_meta: n_expert         = 160
llm_load_print_meta: n_expert_used    = 6
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000,0
llm_load_print_meta: freq_scale_train = 0,025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 236B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 235,74 B
llm_load_print_meta: model size       = 80,04 GiB (2,92 BPW)
llm_load_print_meta: general.name     = DeepSeek-Coder-V2-Instruct
llm_load_print_meta: BOS token        = 100000 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 126 'Ä'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 1
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 1536
llm_load_print_meta: n_expert_shared      = 2
llm_load_print_meta: expert_weights_scale = 16,0
llm_load_print_meta: rope_yarn_log_mul    = 0,1000
llm_load_tensors: ggml ctx size =    0,80 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size = 73728,00 MiB, (73728,08 / 98304,00)

ggml_backend_metal_log_allocated_size: allocated buffer, size =  8748,94 MiB, (82477,02 / 98304,00)
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors:      Metal buffer size = 81961,29 MiB
llm_load_tensors:        CPU buffer size =   164,06 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 163840
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 0,025
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 103079,22 MB
zsh: killed     ./llama-cli -m ./models/DeepSeek-Coder-V2-Instruct_q2_K.gguf --color -i   -p

The text was updated successfully, but these errors were encountered:

MarsBlessed · 2024-06-27T17:03:04Z

there is no problem to run other models like qwen2_q8_0 or mixtral-8x-7b and I already tried other qauntized variants of the same model with the same result

slaren · 2024-06-27T17:05:50Z

Try a lower context size with -c, you are probably running out of memory.

MarsBlessed · 2024-06-27T17:18:36Z

Try a lower context size with -c, you are probably running out of memory.

yes, that let me get a bit further but not very far - extra log I got even at -c 512

llama_kv_cache_init:      Metal KV buffer size =  2400,00 MiB
llama_new_context_with_model: KV self size  = 2400,00 MiB, K (f16): 1440,00 MiB, V (f16):  960,00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0,39 MiB
llama_new_context_with_model:      Metal compute buffer size =   271,00 MiB
llama_new_context_with_model:        CPU compute buffer size =    11,01 MiB
llama_new_context_with_model: graph nodes  = 4480
llama_new_context_with_model: graph splits = 2
libc++abi: terminating due to uncaught exception of type std::length_error: vector
zsh: abort      ./llama-cli -m ./models/DeepSeek-Coder-V2-Instruct_q2_K.gguf --color -i   -p

slaren · 2024-06-27T17:21:34Z

Can you link where you downloaded this model?

MarsBlessed · 2024-06-27T17:28:04Z

Can you link where you downloaded this model?

this particular variant was made by me using llama-quantize from the original model but there are other ready to be downloaded quantized variants on hugging face, e.g. here (which I actually tried)

slaren · 2024-06-27T17:32:54Z

Can you run it with a debugger and see where the exception is being thrown? With a build with debug symbols.

MarsBlessed · 2024-06-27T17:37:49Z

Can you run it with a debugger and see where the exception is being thrown? With a build with debug symbols.

could you help me with the command line to achieve the build with debug symbols?

slaren · 2024-06-27T17:39:44Z

make clean; LLAMA_DEBUG=1 make llama-cli should do it.

MarsBlessed · 2024-06-27T17:59:54Z

make clean; LLAMA_DEBUG=1 make llama-cli should do it.

I assume this should be enough:

libc++abi: terminating due to uncaught exception of type std::length_error: vector
Process 4371 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x000000018879ea60 libsystem_kernel.dylib`__pthread_kill + 8
libsystem_kernel.dylib`:
->  0x18879ea60 <+8>:  b.lo   0x18879ea80               ; <+40>
    0x18879ea64 <+12>: pacibsp
    0x18879ea68 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x18879ea6c <+20>: mov    x29, sp
Target 0: (llama-cli) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x000000018879ea60 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001887d6c20 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001886e3a30 libsystem_c.dylib`abort + 180
    frame #3: 0x000000018878dd08 libc++abi.dylib`abort_message + 132
    frame #4: 0x000000018877dfa4 libc++abi.dylib`demangling_terminate_handler() + 320
    frame #5: 0x000000018841c1e0 libobjc.A.dylib`_objc_terminate() + 160
    frame #6: 0x000000018878d0cc libc++abi.dylib`std::__terminate(void (*)()) + 16
    frame #7: 0x0000000188790348 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88
    frame #8: 0x000000018879028c libc++abi.dylib`__cxa_throw + 308
    frame #9: 0x0000000100005e74 llama-cli`std::__1::__throw_length_error[abi:ue170006](__msg="vector") at stdexcept:261:5
    frame #10: 0x000000010013a510 llama-cli`std::__1::vector<char, std::__1::allocator<char>>::__throw_length_error[abi:ue170006](this=0x000000016fdfa4e0 size=93) const at vector:963:7
    frame #11: 0x00000001002ea350 llama-cli`llama_chat_apply_template(llama_model const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::vector<llama_chat_msg, std::__1::allocator<llama_chat_msg>> const&, bool) + 1204
    frame #12: 0x00000001002ea87c llama-cli`llama_chat_format_example(llama_model const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 428
    frame #13: 0x0000000100340f48 llama-cli`main(argc=11, argv=0x000000016fdff158) at main.cpp:227:5
    frame #14: 0x000000018844e0e0 dyld`start + 2360

slaren · 2024-06-27T18:03:37Z

This should have been fixed in #8160, try updating to master.

MarsBlessed · 2024-06-27T18:29:59Z

fixed by #8160 and context size adjustment

llmlover · 2024-07-01T00:36:11Z

Can someone please explain why this implementation runs significantly slower compared to a dense model with same active parameter count?

MarsBlessed added bug-unconfirmed critical severity Used to report critical severity bugs in llama.cpp (e.g. Crashing, Corrupted, Dataloss) labels Jun 27, 2024

MarsBlessed closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Cannot load DeepSeek-Coder-V2-Instruct #8174

Bug: Cannot load DeepSeek-Coder-V2-Instruct #8174

MarsBlessed commented Jun 27, 2024 •

edited

Loading

MarsBlessed commented Jun 27, 2024

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024 •

edited

Loading

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024 •

edited

Loading

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024 •

edited

Loading

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024 •

edited

Loading

llmlover commented Jul 1, 2024

Bug: Cannot load DeepSeek-Coder-V2-Instruct #8174

Bug: Cannot load DeepSeek-Coder-V2-Instruct #8174

Comments

MarsBlessed commented Jun 27, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

MarsBlessed commented Jun 27, 2024

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024 • edited Loading

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024 • edited Loading

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024 • edited Loading

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024

slaren commented Jun 27, 2024

MarsBlessed commented Jun 27, 2024 • edited Loading

llmlover commented Jul 1, 2024

MarsBlessed commented Jun 27, 2024 •

edited

Loading

MarsBlessed commented Jun 27, 2024 •

edited

Loading

MarsBlessed commented Jun 27, 2024 •

edited

Loading

MarsBlessed commented Jun 27, 2024 •

edited

Loading

MarsBlessed commented Jun 27, 2024 •

edited

Loading