Some more observations regarding `QX_0` quantizations with LLaMAv2 7B #2421

ggerganov · 2023-07-27T09:57:21Z

ggerganov
Jul 27, 2023
Maintainer

In #2276 I reported an unusual behavior of LLaMAv2 7B when using Q4_0 and Q5_0 quantizations. In short, the observation is that with short prompts such as "I believe the meaning of life is" after the end of the first sentence, the generation switches into some weird mode - for example, often generating text in some other language, starting with non-capital letter, etc. This only happens if the sentence ends with .. I'e it doesn't occur when it ends with ! and ?. I find this weird and it has been bugging me ever since.

Today I noticed in the quantize tool output that the tensors in layers 0 and 1 have significantly different weight distributions compared to the other layers:

$ ▶ make -j && ./quantize ./models/7B-v2/ggml-model-f16.bin ./models/7B-v2/ggml-model-q4_0.bin q4_0
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

main: build = 917 (1a94186)
main: quantizing './models/7B-v2/ggml-model-f16.bin' to './models/7B-v2/ggml-model-q4_0.bin' as Q4_0
llama.cpp: loading model from ./models/7B-v2/ggml-model-f16.bin
llama.cpp: saving model to ./models/7B-v2/ggml-model-q4_0.bin
[   1/ 291]                tok_embeddings.weight -     4096 x 32000, type =    f32, quantizing to q4_0 .. size =   500.00 MB ->    70.31 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[   2/ 291]                          norm.weight -             4096, type =    f32, size =    0.016 MB
[   3/ 291]                        output.weight -     4096 x 32000, type =    f32, quantizing to q6_K .. size =   500.00 MB ->   102.54 MB | hist: 
[   4/ 291]         layers.0.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.034 0.008 0.012 0.019 0.031 0.050 0.084 0.149 0.256 0.150 0.084 0.051 0.031 0.019 0.012 0.010 
[   5/ 291]         layers.0.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.034 0.008 0.013 0.021 0.033 0.054 0.089 0.150 0.226 0.151 0.089 0.054 0.033 0.021 0.013 0.011 
[   6/ 291]         layers.0.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.036 0.053 0.074 0.096 0.117 0.129 0.117 0.096 0.074 0.053 0.036 0.024 0.020 
[   7/ 291]         layers.0.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.035 0.011 0.017 0.028 0.044 0.068 0.100 0.135 0.155 0.135 0.100 0.068 0.044 0.028 0.017 0.014 
[   8/ 291]       layers.0.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[   9/ 291]      layers.0.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  10/ 291]      layers.0.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.021 
[  11/ 291]      layers.0.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[  12/ 291]             layers.0.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  13/ 291]         layers.1.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.013 0.022 0.034 0.052 0.074 0.098 0.121 0.132 0.121 0.098 0.074 0.052 0.034 0.022 0.018 
[  14/ 291]         layers.1.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.013 0.022 0.034 0.051 0.074 0.099 0.121 0.132 0.121 0.099 0.074 0.051 0.034 0.022 0.018 
[  15/ 291]         layers.1.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.014 0.023 0.035 0.052 0.073 0.097 0.119 0.130 0.119 0.097 0.074 0.052 0.035 0.023 0.019 
[  16/ 291]         layers.1.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.035 0.012 0.020 0.031 0.047 0.070 0.098 0.129 0.146 0.129 0.099 0.070 0.047 0.031 0.020 0.016 
[  17/ 291]       layers.1.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  18/ 291]      layers.1.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  19/ 291]      layers.1.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[  20/ 291]      layers.1.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[  21/ 291]             layers.1.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  22/ 291]         layers.2.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.096 0.114 0.122 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
[  23/ 291]         layers.2.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.037 0.055 0.075 0.097 0.115 0.124 0.115 0.097 0.075 0.055 0.037 0.024 0.020 
[  24/ 291]         layers.2.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.120 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  25/ 291]         layers.2.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  26/ 291]       layers.2.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  27/ 291]      layers.2.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  28/ 291]      layers.2.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  29/ 291]      layers.2.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  30/ 291]             layers.2.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  31/ 291]         layers.3.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.113 0.120 0.113 0.097 0.076 0.056 0.038 0.025 0.020 
[  32/ 291]         layers.3.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.113 0.120 0.113 0.096 0.076 0.056 0.038 0.025 0.020 
[  33/ 291]         layers.3.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  34/ 291]         layers.3.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.021 
[  35/ 291]       layers.3.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  36/ 291]      layers.3.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  37/ 291]      layers.3.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  38/ 291]      layers.3.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[  39/ 291]             layers.3.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  40/ 291]         layers.4.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.038 0.025 0.021 
[  41/ 291]         layers.4.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.113 0.120 0.113 0.096 0.076 0.056 0.038 0.025 0.020 
[  42/ 291]         layers.4.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  43/ 291]         layers.4.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  44/ 291]       layers.4.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  45/ 291]      layers.4.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  46/ 291]      layers.4.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[  47/ 291]      layers.4.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  48/ 291]             layers.4.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  49/ 291]         layers.5.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.021 
[  50/ 291]         layers.5.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.113 0.096 0.076 0.056 0.038 0.025 0.020 
[  51/ 291]         layers.5.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  52/ 291]         layers.5.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  53/ 291]       layers.5.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  54/ 291]      layers.5.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  55/ 291]      layers.5.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  56/ 291]      layers.5.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[  57/ 291]             layers.5.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  58/ 291]         layers.6.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  59/ 291]         layers.6.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[  60/ 291]         layers.6.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  61/ 291]         layers.6.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  62/ 291]       layers.6.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  63/ 291]      layers.6.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  64/ 291]      layers.6.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  65/ 291]      layers.6.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  66/ 291]             layers.6.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  67/ 291]         layers.7.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.076 0.056 0.039 0.025 0.021 
[  68/ 291]         layers.7.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[  69/ 291]         layers.7.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  70/ 291]         layers.7.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
[  71/ 291]       layers.7.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  72/ 291]      layers.7.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  73/ 291]      layers.7.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  74/ 291]      layers.7.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  75/ 291]             layers.7.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  76/ 291]         layers.8.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.097 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[  77/ 291]         layers.8.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  78/ 291]         layers.8.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  79/ 291]         layers.8.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  80/ 291]       layers.8.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  81/ 291]      layers.8.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  82/ 291]      layers.8.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  83/ 291]      layers.8.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  84/ 291]             layers.8.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  85/ 291]         layers.9.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[  86/ 291]         layers.9.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  87/ 291]         layers.9.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  88/ 291]         layers.9.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  89/ 291]       layers.9.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  90/ 291]      layers.9.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  91/ 291]      layers.9.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  92/ 291]      layers.9.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  93/ 291]             layers.9.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  94/ 291]        layers.10.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[  95/ 291]        layers.10.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  96/ 291]        layers.10.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  97/ 291]        layers.10.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021 
[  98/ 291]      layers.10.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  99/ 291]     layers.10.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 100/ 291]     layers.10.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 101/ 291]     layers.10.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 102/ 291]            layers.10.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 103/ 291]        layers.11.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 104/ 291]        layers.11.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 105/ 291]        layers.11.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.038 0.025 0.021 
[ 106/ 291]        layers.11.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 107/ 291]      layers.11.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 108/ 291]     layers.11.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 109/ 291]     layers.11.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 110/ 291]     layers.11.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 111/ 291]            layers.11.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 112/ 291]        layers.12.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 113/ 291]        layers.12.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 114/ 291]        layers.12.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 115/ 291]        layers.12.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 116/ 291]      layers.12.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 117/ 291]     layers.12.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 118/ 291]     layers.12.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.038 0.025 0.021 
[ 119/ 291]     layers.12.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 120/ 291]            layers.12.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 121/ 291]        layers.13.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 122/ 291]        layers.13.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 123/ 291]        layers.13.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 124/ 291]        layers.13.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 125/ 291]      layers.13.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 126/ 291]     layers.13.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 127/ 291]     layers.13.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.021 
[ 128/ 291]     layers.13.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 129/ 291]            layers.13.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 130/ 291]        layers.14.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 131/ 291]        layers.14.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 132/ 291]        layers.14.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 133/ 291]        layers.14.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 134/ 291]      layers.14.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 135/ 291]     layers.14.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 136/ 291]     layers.14.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 137/ 291]     layers.14.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 138/ 291]            layers.14.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 139/ 291]        layers.15.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 140/ 291]        layers.15.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 141/ 291]        layers.15.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 142/ 291]        layers.15.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 143/ 291]      layers.15.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 144/ 291]     layers.15.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 145/ 291]     layers.15.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.021 
[ 146/ 291]     layers.15.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 147/ 291]            layers.15.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 148/ 291]        layers.16.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 149/ 291]        layers.16.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 150/ 291]        layers.16.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 151/ 291]        layers.16.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021 
[ 152/ 291]      layers.16.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 153/ 291]     layers.16.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 154/ 291]     layers.16.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 155/ 291]     layers.16.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 156/ 291]            layers.16.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 157/ 291]        layers.17.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 158/ 291]        layers.17.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 159/ 291]        layers.17.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 160/ 291]        layers.17.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 161/ 291]      layers.17.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 162/ 291]     layers.17.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 163/ 291]     layers.17.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
[ 164/ 291]     layers.17.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 165/ 291]            layers.17.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 166/ 291]        layers.18.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 167/ 291]        layers.18.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 168/ 291]        layers.18.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.111 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 169/ 291]        layers.18.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 170/ 291]      layers.18.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 171/ 291]     layers.18.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 172/ 291]     layers.18.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
[ 173/ 291]     layers.18.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 174/ 291]            layers.18.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 175/ 291]        layers.19.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 176/ 291]        layers.19.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 177/ 291]        layers.19.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.111 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 178/ 291]        layers.19.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 179/ 291]      layers.19.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 180/ 291]     layers.19.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 181/ 291]     layers.19.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 182/ 291]     layers.19.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 183/ 291]            layers.19.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 184/ 291]        layers.20.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.056 0.039 0.025 0.021 
[ 185/ 291]        layers.20.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 186/ 291]        layers.20.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 187/ 291]        layers.20.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 188/ 291]      layers.20.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 189/ 291]     layers.20.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 190/ 291]     layers.20.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 191/ 291]     layers.20.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 192/ 291]            layers.20.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 193/ 291]        layers.21.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 194/ 291]        layers.21.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 195/ 291]        layers.21.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.118 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 196/ 291]        layers.21.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 197/ 291]      layers.21.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 198/ 291]     layers.21.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 199/ 291]     layers.21.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 200/ 291]     layers.21.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 201/ 291]            layers.21.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 202/ 291]        layers.22.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 203/ 291]        layers.22.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.076 0.056 0.039 0.025 0.021 
[ 204/ 291]        layers.22.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.118 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 205/ 291]        layers.22.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 206/ 291]      layers.22.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 207/ 291]     layers.22.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 208/ 291]     layers.22.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 209/ 291]     layers.22.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 210/ 291]            layers.22.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 211/ 291]        layers.23.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.111 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 212/ 291]        layers.23.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 213/ 291]        layers.23.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.118 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 214/ 291]        layers.23.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.026 0.021 
[ 215/ 291]      layers.23.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 216/ 291]     layers.23.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 217/ 291]     layers.23.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 218/ 291]     layers.23.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 219/ 291]            layers.23.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 220/ 291]        layers.24.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 221/ 291]        layers.24.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.118 0.112 0.097 0.076 0.056 0.039 0.025 0.021 
[ 222/ 291]        layers.24.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.118 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 223/ 291]        layers.24.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.026 0.021 
[ 224/ 291]      layers.24.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 225/ 291]     layers.24.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 226/ 291]     layers.24.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 227/ 291]     layers.24.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 228/ 291]            layers.24.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 229/ 291]        layers.25.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 230/ 291]        layers.25.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 231/ 291]        layers.25.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 232/ 291]        layers.25.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 233/ 291]      layers.25.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 234/ 291]     layers.25.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 235/ 291]     layers.25.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 236/ 291]     layers.25.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 237/ 291]            layers.25.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 238/ 291]        layers.26.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 239/ 291]        layers.26.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 240/ 291]        layers.26.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.118 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 241/ 291]        layers.26.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 242/ 291]      layers.26.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 243/ 291]     layers.26.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 244/ 291]     layers.26.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 245/ 291]     layers.26.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 246/ 291]            layers.26.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 247/ 291]        layers.27.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 248/ 291]        layers.27.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 249/ 291]        layers.27.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 250/ 291]        layers.27.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 251/ 291]      layers.27.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 252/ 291]     layers.27.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 253/ 291]     layers.27.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 254/ 291]     layers.27.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 255/ 291]            layers.27.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 256/ 291]        layers.28.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.111 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 257/ 291]        layers.28.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 258/ 291]        layers.28.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.118 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 259/ 291]        layers.28.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 260/ 291]      layers.28.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 261/ 291]     layers.28.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 262/ 291]     layers.28.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 263/ 291]     layers.28.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 264/ 291]            layers.28.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 265/ 291]        layers.29.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 266/ 291]        layers.29.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 267/ 291]        layers.29.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 268/ 291]        layers.29.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 269/ 291]      layers.29.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 270/ 291]     layers.29.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 271/ 291]     layers.29.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.021 
[ 272/ 291]     layers.29.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 273/ 291]            layers.29.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 274/ 291]        layers.30.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 275/ 291]        layers.30.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[ 276/ 291]        layers.30.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.118 0.111 0.096 0.077 0.056 0.039 0.025 0.021 
[ 277/ 291]        layers.30.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 278/ 291]      layers.30.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 279/ 291]     layers.30.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 280/ 291]     layers.30.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.120 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
[ 281/ 291]     layers.30.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[ 282/ 291]            layers.30.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 283/ 291]        layers.31.attention.wq.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.077 0.056 0.038 0.025 0.021 
[ 284/ 291]        layers.31.attention.wk.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.118 0.112 0.097 0.076 0.056 0.038 0.025 0.021 
[ 285/ 291]        layers.31.attention.wv.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[ 286/ 291]        layers.31.attention.wo.weight -     4096 x  4096, type =    f32, quantizing to q4_0 .. size =    64.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 287/ 291]      layers.31.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[ 288/ 291]     layers.31.feed_forward.w1.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[ 289/ 291]     layers.31.feed_forward.w2.weight -    11008 x  4096, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.015 0.023 0.036 0.054 0.075 0.098 0.116 0.124 0.116 0.098 0.075 0.054 0.036 0.023 0.019 
[ 290/ 291]     layers.31.feed_forward.w3.weight -     4096 x 11008, type =    f32, quantizing to q4_0 .. size =   172.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[ 291/ 291]            layers.31.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
llama_model_quantize_internal: model size  = 25705.02 MB
llama_model_quantize_internal: quant size  =  3647.87 MB
llama_model_quantize_internal: hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.096 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 

main: quantize time = 12956.85 ms
main:    total time = 12956.85 ms

This discrepancy is also observed to similar extend for LLaMAv1.

So I decided to increase the quantization accuracy of the tensors in layer 0 and 1 to see what happens.
The weird behavior disappeared. I tried doing the same for other layers and it does not help.

More strange is that after a few more experiments, the only tensor that I have to quantize more accurately (i.e. either Q6_K or Q8_0) to fix the crazy texts is layers.1.feed_forward.w2.

For example, here are the top token probability for the first token in the second sentence using vanilla Q4_0:

make -j main && time ./main -m ./models/7B-v2/ggml-model-q4_0.bin -p "I believe the meaning of life is to be happy." --no-mmap -t 8 -n 32 --ignore-eos -eps 1e-5

...
 I believe the meaning of life is to be happy.
' nobody': 0,114565
' everybody': 0,101579
' Hinweis': 0,091655
' Unterscheidung': 0,075012
' sierp': 0,060685
' Einzeln': 0,059432
' живело': 0,045674
' kwiet': 0,039094
' hopefully': 0,036275
' фев': 0,031292
' округу': 0,028931
' Begriffe': 0,026573
' paździer': 0,026076
' савез': 0,020107
'ℓ': 0,017165
'➖': 0,016931
'nahm': 0,015712
' surely': 0,015097
' проф': 0,014356
'ϊ': 0,013367
' броја': 0,013126
' geprüft': 0,012605
'();`': 0,012045
' сайт': 0,011995
' obviously': 0,011706
' prü': 0,010286
'iellement': 0,010194
' stycz': 0,009988
' références': 0,009927
'его': 0,008764
' Википеди': 0,008759
' фран': 0,008071
' everyone': 0,007789
'☉': 0,007600
'ℂ': 0,007566

You can see, these tokens don't make sense for a start of a new sentence.

Here are the probabilities with layers.1.feed_forward.w2 quantized with Q6_K, everything else is the same:

 I believe the meaning of life is to be happy.
' Person': 0,355184
' This': 0,063345
' By': 0,059133
' And': 0,050152
' We': 0,037688
' It': 0,027675
' My': 0,027099
' CD': 0,025872
' That': 0,021084
' ag': 0,020682
' Thus': 0,018213
' Will': 0,018031
' I': 0,016000
' Pub': 0,015950
'weig': 0,015315
' Like': 0,015251
' If': 0,015041
' To': 0,014383
'',': 0,013128
' Each': 0,012909
' Sub': 0,012148
' Ple': 0,012058
' Short': 0,011744
' These': 0,011645
' Are': 0,009932
' When': 0,009854
' You': 0,009733
'
': 0,009512
' Context': 0,009323
'cd': 0,009190
' Pur': 0,009154
' As': 0,009126
' Due': 0,009001
' Being': 0,008896
'yen': 0,008498
' Because': 0,008050

Looks much better and more consistent with full F16 precision token candidates:

 I believe the meaning of life is to be happy.
'
': 0,333378
' It': 0,079239
' I': 0,078752
' We': 0,063356
' The': 0,057728
' You': 0,045964
' If': 0,035171
' That': 0,032833
' And': 0,026973
' To': 0,026758
' What': 0,018570
' Life': 0,018177
' This': 0,018078
' In': 0,016524
' As': 0,016200
' When': 0,015222
' There': 0,013985
' So': 0,012524
' A': 0,011788
' Every': 0,009176
' But': 0,009140
' Being': 0,007576
' Happy': 0,007533
' My': 0,007494
' No': 0,006082
' How': 0,005494
' For': 0,004866
' By': 0,004644
' At': 0,004444
' Why': 0,004330
' Love': 0,004115
' People': 0,003883

Thought this is an interesting observation to share.
Here is a diff patch if anyone wants to play with this further:

diff --git a/llama.cpp b/llama.cpp
index 9a8ecdc..3816f95 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -2740,9 +2740,12 @@ llama_token llama_sample_token(struct llama_context * ctx, llama_token_data_arra
 
     std::vector<float> probs;
     probs.reserve(candidates->size);
+    printf("\n");
     for (size_t i = 0; i < candidates->size; ++i) {
         probs.push_back(candidates->data[i].p);
+        printf("'%s': %f\n", llama_token_to_str(ctx, candidates->data[i].id), candidates->data[i].p);
     }
+    exit(0);
 
     std::discrete_distribution<> dist(probs.begin(), probs.end());
     auto & rng = ctx->rng;
@@ -2936,7 +2939,13 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
         } else {
             new_type = quantized_type;
 #ifdef GGML_USE_K_QUANTS
-            if (tensor.name == "output.weight") {
+            //if (tensor.name == "output.weight") {
+            if (tensor.name == "output.weight" ||
+                //tensor.name.find("layers.0.attention.") != std::string::npos ||
+                //tensor.name.find("layers.0.feed_forward.") != std::string::npos ||
+                //tensor.name.find("layers.1.attention.") != std::string::npos ||
+                //tensor.name.find("layers.1.feed_forward.w2") != std::string::npos ||
+                false) {
                 int nx = tensor.ne.at(0);
                 int ny = tensor.ne.at(1);
                 if (nx % QK_K == 0 && ny % QK_K == 0) {

klosax · 2023-07-27T11:52:44Z

klosax
Jul 27, 2023

Interesting indeed.

Maybe we could figure out a slower quantization mode which somehow measures the weight distribution on each tensor and selects a good enough quantization method for it. This would give a good compromise between quality, speed and model size.

Simply calculating the variance of the tensor weights may work, and experimentally determinate which quantization method to use for different variance ranges.

0 replies

ikawrakow · 2023-07-27T19:11:18Z

ikawrakow
Jul 27, 2023

The feed_forward.w2 tensors are by far the most important tensors in the LLaMA models when it comes to generation quality. Next most important tensors are attention.wv. See for instance #1256 where this was first discussed. The QX_K_M k-quants make use of this observation to quantize half of these tensors using more bits. If one wants to only spend more bits on very few layers, then yes, the first few layers are most important (with just one layer being a special case of "very few").

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some more observations regarding `QX_0` quantizations with LLaMAv2 7B #2421

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Some more observations regarding QX_0 quantizations with LLaMAv2 7B #2421

ggerganov Jul 27, 2023 Maintainer

Replies: 2 comments

klosax Jul 27, 2023

ikawrakow Jul 27, 2023

Some more observations regarding `QX_0` quantizations with LLaMAv2 7B #2421

ggerganov
Jul 27, 2023
Maintainer

klosax
Jul 27, 2023

ikawrakow
Jul 27, 2023