Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very low and incorrect Hellaswag score #4980

Closed
JianbangZ opened this issue Jan 16, 2024 · 3 comments
Closed

Very low and incorrect Hellaswag score #4980

JianbangZ opened this issue Jan 16, 2024 · 3 comments

Comments

@JianbangZ
Copy link

JianbangZ commented Jan 16, 2024

I tried to ran 0-shot hellaswag scores with various models, and they all produce very low score (quant or non-quant). Specifically for MIstral-7B-v0.1

Here is my script

CUDA_VISIBLE_DEVICES=0 ./perplexity --hellaswag -ngl 99 -m /data1/models/mistral/Mistral-7B-v0.1/gguf/Mistral-7B-v0.1.f16.gguf -f /data1/datasets/hellaswag_val_full.txt
CUDA_VISIBLE_DEVICES=0 ./perplexity --hellaswag --hellaswag-tasks 2000 -ngl 99 -m /data1/models/mistral/Mistral-7B-v0.1/gguf/Mistral-7B-v0.1.f16.gguf -f /data1/datasets/hellaswag_val_full.txt

When using 400 randomized tasks, score is 50.5
When using 2000, score is 50.55

I also tested the Q8_0 models I quantized myself, score is 50.5.
I also tested the Q8_0 model I downloaded from TheBloke HF, score is also 50.5. This means it's not my model's problem.

Here are the logs
main: build = 1878 (4483396)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1705416778
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /data1/models/mistral/Mistral-7B-v0.1/gguf/Mistral-7B-v0.1.f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = mistral
llama_model_loader: - kv 2: llama.context_length u32 = 32768
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 1
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,58980] = ["▁ t", "i n", "e r", "▁ a", "h e...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 226 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 7.24 B
llm_load_print_meta: model size = 13.49 GiB (16.00 BPW)
llm_load_print_meta: general.name = mistral
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 250.00 MiB
llm_load_tensors: CUDA0 buffer size = 13563.02 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB
llama_new_context_with_model: graph splits (measure): 3
llama_new_context_with_model: CUDA0 compute buffer size = 73.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 9.00 MiB

system_info: n_threads = 32 / 64 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
hellaswag_score : loaded 10042 tasks from prompt.
================================= is_spm = 1
hellaswag_score : selecting 400 randomized tasks.
hellaswag_score : calculating hellaswag score over selected tasks.

task acc_norm
1 0.00000000
2 0.00000000
3 0.00000000
4 25.00000000
5 20.00000000
6 33.33333333
7 28.57142857
8 37.50000000
9 33.33333333
10 30.00000000
11 27.27272727
12 25.00000000
13 23.07692308
14 28.57142857
15 26.66666667
16 25.00000000
17 29.41176471
18 33.33333333
19 36.84210526
20 35.00000000
21 38.09523810
22 40.90909091
23 39.13043478
24 37.50000000
25 40.00000000
26 38.46153846
27 40.74074074
28 39.28571429
29 37.93103448
30 40.00000000
31 38.70967742
32 40.62500000
33 39.39393939
34 38.23529412
35 37.14285714
36 38.88888889
37 37.83783784
38 39.47368421
39 38.46153846
40 40.00000000
41 39.02439024
42 40.47619048
43 39.53488372
44 38.63636364
45 37.77777778
46 36.95652174
47 38.29787234
48 37.50000000
49 36.73469388
50 38.00000000
51 37.25490196
52 36.53846154
53 37.73584906
54 37.03703704
55 36.36363636
56 35.71428571
57 35.08771930
58 36.20689655
59 37.28813559
60 36.66666667
61 37.70491803
62 37.09677419
63 38.09523810
64 37.50000000
65 38.46153846
66 37.87878788
67 37.31343284
68 36.76470588
69 36.23188406
70 35.71428571
71 36.61971831
72 37.50000000
73 38.35616438
74 37.83783784
75 37.33333333
76 36.84210526
77 36.36363636
78 35.89743590
79 36.70886076
80 36.25000000
81 37.03703704
82 36.58536585
83 36.14457831
84 36.90476190
85 37.64705882
86 37.20930233
87 37.93103448
88 38.63636364
89 39.32584270
90 40.00000000
91 39.56043956
92 39.13043478
93 38.70967742
94 39.36170213
95 40.00000000
96 40.62500000
97 41.23711340
98 41.83673469
99 41.41414141
100 42.00000000
101 42.57425743
102 42.15686275
103 41.74757282
104 42.30769231
105 41.90476190
106 42.45283019
107 42.05607477
108 42.59259259
109 42.20183486
110 42.72727273
111 43.24324324
112 43.75000000
113 44.24778761
114 43.85964912
115 44.34782609
116 44.82758621
117 45.29914530
118 44.91525424
119 44.53781513
120 45.00000000
121 45.45454545
122 45.90163934
123 45.52845528
124 45.16129032
125 44.80000000
126 44.44444444
127 44.88188976
128 44.53125000
129 44.18604651
130 44.61538462
131 45.03816794
132 44.69696970
133 44.36090226
134 44.02985075
135 43.70370370
136 44.11764706
137 44.52554745
138 44.92753623
139 44.60431655
140 44.28571429
141 43.97163121
142 43.66197183
143 43.35664336
144 43.75000000
145 44.13793103
146 43.83561644
147 43.53741497
148 43.24324324
149 42.95302013
150 43.33333333
151 43.04635762
152 42.76315789
153 42.48366013
154 42.85714286
155 42.58064516
156 42.94871795
157 42.67515924
158 42.40506329
159 42.13836478
160 41.87500000
161 41.61490683
162 41.97530864
163 41.71779141
164 42.07317073
165 42.42424242
166 42.16867470
167 42.51497006
168 42.26190476
169 42.60355030
170 42.35294118
171 42.69005848
172 42.44186047
173 42.77456647
174 43.10344828
175 43.42857143
176 43.75000000
177 44.06779661
178 44.38202247
179 44.13407821
180 44.44444444
181 44.75138122
182 45.05494505
183 44.80874317
184 45.10869565
185 45.40540541
186 45.69892473
187 45.98930481
188 45.74468085
189 46.03174603
190 45.78947368
191 46.07329843
192 45.83333333
193 46.11398964
194 45.87628866
195 46.15384615
196 46.42857143
197 46.19289340
198 45.95959596
199 45.72864322
200 45.50000000
201 45.27363184
202 45.54455446
203 45.81280788
204 46.07843137
205 45.85365854
206 46.11650485
207 46.37681159
208 46.63461538
209 46.88995215
210 47.14285714
211 46.91943128
212 46.69811321
213 46.47887324
214 46.72897196
215 46.51162791
216 46.75925926
217 47.00460829
218 47.24770642
219 47.48858447
220 47.72727273
221 47.51131222
222 47.29729730
223 47.53363229
224 47.32142857
225 47.11111111
226 47.34513274
227 47.13656388
228 47.36842105
229 47.59825328
230 47.82608696
231 47.61904762
232 47.84482759
233 47.63948498
234 47.86324786
235 48.08510638
236 47.88135593
237 47.67932489
238 47.89915966
239 48.11715481
240 47.91666667
241 47.71784232
242 47.93388430
243 47.73662551
244 47.95081967
245 48.16326531
246 47.96747967
247 48.17813765
248 47.98387097
249 48.19277108
250 48.40000000
251 48.60557769
252 48.80952381
253 49.01185771
254 48.81889764
255 48.62745098
256 48.43750000
257 48.63813230
258 48.44961240
259 48.26254826
260 48.07692308
261 48.27586207
262 48.47328244
263 48.28897338
264 48.10606061
265 48.30188679
266 48.49624060
267 48.68913858
268 48.50746269
269 48.32713755
270 48.51851852
271 48.70848708
272 48.89705882
273 49.08424908
274 49.27007299
275 49.45454545
276 49.27536232
277 49.45848375
278 49.64028777
279 49.82078853
280 49.64285714
281 49.82206406
282 49.64539007
283 49.46996466
284 49.29577465
285 49.47368421
286 49.30069930
287 49.47735192
288 49.65277778
289 49.82698962
290 50.00000000
291 50.17182131
292 50.00000000
293 50.17064846
294 50.34013605
295 50.16949153
296 50.33783784
297 50.16835017
298 50.33557047
299 50.50167224
300 50.66666667
301 50.49833887
302 50.66225166
303 50.49504950
304 50.65789474
305 50.49180328
306 50.65359477
307 50.81433225
308 50.64935065
309 50.80906149
310 50.64516129
311 50.48231511
312 50.64102564
313 50.79872204
314 50.63694268
315 50.79365079
316 50.94936709
317 50.78864353
318 50.62893082
319 50.78369906
320 50.62500000
321 50.46728972
322 50.62111801
323 50.77399381
324 50.92592593
325 51.07692308
326 50.92024540
327 51.07033639
328 51.21951220
329 51.36778116
330 51.51515152
331 51.35951662
332 51.20481928
333 51.05105105
334 51.19760479
335 51.04477612
336 51.19047619
337 51.03857567
338 51.18343195
339 51.03244838
340 51.17647059
341 51.02639296
342 50.87719298
343 51.02040816
344 51.16279070
345 51.01449275
346 50.86705202
347 50.72046110
348 50.57471264
349 50.71633238
350 50.57142857
351 50.42735043
352 50.28409091
353 50.14164306
354 50.00000000
355 50.14084507
356 50.28089888
357 50.14005602
358 50.00000000
359 49.86072423
360 50.00000000
361 50.13850416
362 50.00000000
363 49.86225895
364 49.72527473
365 49.86301370
366 50.00000000
367 50.13623978
368 50.27173913
369 50.40650407
370 50.54054054
371 50.67385445
372 50.53763441
373 50.67024129
374 50.80213904
375 50.66666667
376 50.79787234
377 50.66312997
378 50.52910053
379 50.39577836
380 50.52631579
381 50.39370079
382 50.26178010
383 50.39164491
384 50.26041667
385 50.12987013
386 50.25906736
387 50.12919897
388 50.00000000
389 50.12853470
390 50.25641026
391 50.38363171
392 50.51020408
393 50.63613232
394 50.50761421
395 50.63291139
396 50.50505051
397 50.37783375
398 50.50251256
399 50.62656642
400 50.50000000

llama_print_timings: load time = 27306.32 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time = 34976.97 ms / 70708 tokens ( 0.49 ms per token, 2021.56 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 41006.48 ms / 70709 tokens

@ikawrakow
Copy link
Contributor

I haven't used this for a while, but running it now on latest master and comparing with some logs from the past that I have kept around, this is completely broken.

@MarcusDunn
Copy link
Contributor

I believe this was fixed by #4981 and can be closed.

@ikawrakow
Copy link
Contributor

Closed via #4981

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants