Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RMSE-optimized quants for all quantization types #1106

Closed
wants to merge 3 commits into from

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Apr 21, 2023

The PR adds a new build option (LLAMA_NO_RMSE), which is off by default. When off, all current quantization types (Q4_0, Q4_1, Q4_2, Q4_3) are performed with RMSE minimization (on master RMSE minimization is enabled for Q4_2 only and cannot easily be disabled).

This makes generation of quantized models quite a bit longer, but still in the same ballpark as it used to take before it was multi-threaded in PR #1075.

With this option enabled, Q4_3 gives a perplexity of 6.0344 for the 7B model, so 0.0273 lower than simple Q4_3 quantization as reported by @ggerganov in #406. If I also enable his trick of not quantizing output tensors, perplexity becomes 6.0085.

Perplexity result for Q4_3 without quantization of output tensors for the 13B model is 5.3117.

Details for these perplexity runs can be found in here (issue #406)

As far as I can tell, we are now on par with best known GPTQ result for 7B, and better for 13B by about 0.05.

@Green-Sky
Copy link
Collaborator

sounds like a good idea. for me personally io is the bottleneck, since i store them on a NAS.

@sw
Copy link
Contributor

sw commented Apr 21, 2023

It might be a good idea to get #953 merged first, which implements unit tests for the quantization. But that requires an improvement to the test samples.

Copy link
Contributor

@sw sw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update SHA256SUMS, at the very least remove the files which are now different.

float scale = sumlx/suml2;
return scale;
}
static float kquantize_q4_with_bound_plus(int n, int nmax, const float * restrict X, int nCandidates,
Copy link
Contributor

@sw sw Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does _plus mean?

Couldn't you re-use kquantize_q4_with_bounds with nmin=0?

@ggerganov ggerganov added high priority Very important issue generation quality Quality of model output labels Apr 22, 2023
@sw
Copy link
Contributor

sw commented Apr 22, 2023

I'm still a bit skeptical if chasing after RMSE is the right thing to do.

Let me explain what I mean: originally the Q4 methods calculate max(abs()) and divide that by 7.
#729 intends to calculate the signed max, then divide by 8. This PR tries to find the divisor for minimum RMS error. But maybe the princess is in another castle?

What if it actually helps perplexity if we clip the largest values somewhat, even if that comes at a higher RMS error?

   ^
p  |
e  |
r  | *
p  |     orig                        *
l  |     *      #729                  
e  |            *              *
x  | - - - - - - - - - - - - - - - - < RMSE optimum #1106
i  |                           
t  |                   *             < perplexity optimum?
y  |
   +-----|------|------|-------------> 
         7      8      ?
             scale factor

So the approach to find that would be use #729, choose a value in the interesting range of maybe [7,11], quantize the model, do a perplexity run, lather, rinse, repeat.

@ggerganov
Copy link
Owner

ggerganov commented Apr 22, 2023

@ikawrakow

Just made a full cuBLAS run on 13B using Q4_3, without RMSE optimization and output in F16 precision and got: 5.3075

main: seed = 1682170268
llama.cpp: loading model from ../models/13B/ggml-model-q4_3-output-f16.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 6 (mostly Q4_3)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 9734493.73 KB
llama_model_load_internal: mem required  = 11554.34 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.93 seconds per pass - ETA 32 minutes
[1]3.7052,[2]4.1553,[3]4.9530,[4]5.3817,[5]5.5598,[6]5.4938,[7]5.6338,[8]5.7492,[9]6.0136,[10]6.2525,[11]6.4388,[12]6.4983,[13]6.4590,[14]6.5567,[15]6.7657,[16]6.4420,[17]6.3526,[18]6.3318,[19]6.0375,[20]6.0170,[21]5.9417,[22]5.7639,[23]5.7352,[24]5.6400,[25]5.6548,[26]5.5023,[27]5.3302,[28]5.2330,[29]5.1565,[30]5.0200,[31]4.9747,[32]4.9854,[33]4.9409,[34]4.9796,[35]4.9984,[36]5.0189,[37]5.0113,[38]5.0078,[39]5.0349,[40]5.0774,[41]5.0999,[42]5.1325,[43]5.0970,[44]5.1402,[45]5.1450,[46]5.1202,[47]5.1464,[48]5.1286,[49]5.1304,[50]5.0999,[51]5.1075,[52]5.1012,[53]5.1478,[54]5.1379,[55]5.1200,[56]5.1404,[57]5.1594,[58]5.1818,[59]5.2003,[60]5.2387,[61]5.2315,[62]5.2862,[63]5.3117,[64]5.3227,[65]5.3586,[66]5.3594,[67]5.3771,[68]5.3901,[69]5.4182,[70]5.4484,[71]5.4717,[72]5.5064,[73]5.5534,[74]5.5610,[75]5.5703,[76]5.5838,[77]5.5960,[78]5.5827,[79]5.6087,[80]5.6043,[81]5.6133,[82]5.6107,[83]5.5655,[84]5.5553,[85]5.5483,[86]5.5331,[87]5.4686,[88]5.4265,[89]5.4044,[90]5.3939,[91]5.4152,[92]5.4128,[93]5.4153,[94]5.4153,[95]5.4412,[96]5.4383,[97]5.4336,[98]5.4300,[99]5.4225,[100]5.4204,[101]5.4440,[102]5.4397,[103]5.4550,[104]5.4598,[105]5.4610,[106]5.4753,[107]5.4745,[108]5.4894,[109]5.4882,[110]5.4833,[111]5.5022,[112]5.5191,[113]5.5182,[114]5.5175,[115]5.5215,[116]5.5093,[117]5.5097,[118]5.5330,[119]5.5514,[120]5.5800,[121]5.5945,[122]5.6158,[123]5.6525,[124]5.6684,[125]5.6634,[126]5.6990,[127]5.7300,[128]5.7574,[129]5.7454,[130]5.7539,[131]5.7490,[132]5.7446,[133]5.7318,[134]5.7402,[135]5.7392,[136]5.7311,[137]5.7266,[138]5.7136,[139]5.7058,[140]5.7050,[141]5.6776,[142]5.6734,[143]5.6487,[144]5.6326,[145]5.6238,[146]5.6132,[147]5.6179,[148]5.6202,[149]5.6169,[150]5.6165,[151]5.6212,[152]5.6153,[153]5.6064,[154]5.6005,[155]5.6066,[156]5.6042,[157]5.6202,[158]5.6226,[159]5.6232,[160]5.6268,[161]5.6384,[162]5.6133,[163]5.6034,[164]5.5826,[165]5.5576,[166]5.5342,[167]5.5020,[168]5.4757,[169]5.4622,[170]5.4531,[171]5.4325,[172]5.4202,[173]5.4072,[174]5.3805,[175]5.3599,[176]5.3462,[177]5.3294,[178]5.3096,[179]5.2962,[180]5.2892,[181]5.2729,[182]5.2565,[183]5.2445,[184]5.2435,[185]5.2367,[186]5.2377,[187]5.2436,[188]5.2419,[189]5.2583,[190]5.2585,[191]5.2758,[192]5.2892,[193]5.3032,[194]5.3145,[195]5.3332,[196]5.3447,[197]5.3635,[198]5.3770,[199]5.3788,[200]5.3797,[201]5.3730,[202]5.3862,[203]5.3922,[204]5.3871,[205]5.3960,[206]5.4014,[207]5.3972,[208]5.4033,[209]5.4065,[210]5.4120,[211]5.4227,[212]5.4292,[213]5.4386,[214]5.4415,[215]5.4445,[216]5.4570,[217]5.4734,[218]5.4867,[219]5.4863,[220]5.4836,[221]5.4789,[222]5.4792,[223]5.4732,[224]5.4665,[225]5.4628,[226]5.4829,[227]5.4883,[228]5.4956,[229]5.5025,[230]5.4989,[231]5.5143,[232]5.5036,[233]5.4888,[234]5.4747,[235]5.4525,[236]5.4473,[237]5.4386,[238]5.4417,[239]5.4306,[240]5.4218,[241]5.4251,[242]5.4265,[243]5.4257,[244]5.4163,[245]5.4128,[246]5.4028,[247]5.3930,[248]5.3868,[249]5.3837,[250]5.3874,[251]5.3792,[252]5.3743,[253]5.3653,[254]5.3607,[255]5.3515,[256]5.3350,[257]5.3249,[258]5.3183,[259]5.3173,[260]5.3090,[261]5.3038,[262]5.2997,[263]5.2947,[264]5.2711,[265]5.2707,[266]5.2679,[267]5.2618,[268]5.2684,[269]5.2676,[270]5.2685,[271]5.2749,[272]5.2778,[273]5.2794,[274]5.2802,[275]5.2861,[276]5.2918,[277]5.3039,[278]5.3125,[279]5.3207,[280]5.3244,[281]5.3339,[282]5.3395,[283]5.3517,[284]5.3602,[285]5.3681,[286]5.3805,[287]5.3778,[288]5.3831,[289]5.3770,[290]5.3628,[291]5.3498,[292]5.3364,[293]5.3246,[294]5.3254,[295]5.3256,[296]5.3304,[297]5.3295,[298]5.3317,[299]5.3295,[300]5.3208,[301]5.3211,[302]5.3147,[303]5.3065,[304]5.2992,[305]5.2967,[306]5.2864,[307]5.2893,[308]5.2904,[309]5.2772,[310]5.2743,[311]5.2698,[312]5.2711,[313]5.2657,[314]5.2642,[315]5.2510,[316]5.2470,[317]5.2344,[318]5.2184,[319]5.2289,[320]5.2399,[321]5.2447,[322]5.2418,[323]5.2358,[324]5.2339,[325]5.2436,[326]5.2452,[327]5.2460,[328]5.2495,[329]5.2540,[330]5.2561,[331]5.2663,[332]5.2627,[333]5.2701,[334]5.2656,[335]5.2605,[336]5.2629,[337]5.2619,[338]5.2615,[339]5.2571,[340]5.2539,[341]5.2602,[342]5.2634,[343]5.2674,[344]5.2677,[345]5.2692,[346]5.2676,[347]5.2712,[348]5.2750,[349]5.2773,[350]5.2754,[351]5.2767,[352]5.2769,[353]5.2716,[354]5.2725,[355]5.2774,[356]5.2802,[357]5.2774,[358]5.2854,[359]5.2874,[360]5.2843,[361]5.2843,[362]5.2913,[363]5.3020,[364]5.3072,[365]5.3110,[366]5.3126,[367]5.3213,[368]5.3190,[369]5.3204,[370]5.3224,[371]5.3185,[372]5.3231,[373]5.3270,[374]5.3251,[375]5.3248,[376]5.3306,[377]5.3271,[378]5.3296,[379]5.3330,[380]5.3264,[381]5.3235,[382]5.3196,[383]5.3176,[384]5.3176,[385]5.3166,[386]5.3152,[387]5.3152,[388]5.3126,[389]5.3088,[390]5.3036,[391]5.2979,[392]5.2944,[393]5.2939,[394]5.2970,[395]5.2963,[396]5.2909,[397]5.2973,[398]5.3014,[399]5.3083,[400]5.3077,[401]5.3085,[402]5.3097,[403]5.3119,[404]5.3173,[405]5.3023,[406]5.2982,[407]5.2970,[408]5.2980,[409]5.3090,[410]5.3178,[411]5.3271,[412]5.3412,[413]5.3513,[414]5.3571,[415]5.3630,[416]5.3702,[417]5.3798,[418]5.3822,[419]5.3871,[420]5.3947,[421]5.4045,[422]5.4077,[423]5.4134,[424]5.4224,[425]5.4301,[426]5.4360,[427]5.4401,[428]5.4473,[429]5.4509,[430]5.4572,[431]5.4696,[432]5.4727,[433]5.4721,[434]5.4688,[435]5.4701,[436]5.4730,[437]5.4812,[438]5.4887,[439]5.4856,[440]5.4850,[441]5.4808,[442]5.4796,[443]5.4807,[444]5.4824,[445]5.4815,[446]5.4835,[447]5.4859,[448]5.4892,[449]5.4876,[450]5.4888,[451]5.4862,[452]5.4707,[453]5.4614,[454]5.4560,[455]5.4563,[456]5.4601,[457]5.4612,[458]5.4594,[459]5.4592,[460]5.4665,[461]5.4622,[462]5.4588,[463]5.4568,[464]5.4564,[465]5.4542,[466]5.4466,[467]5.4453,[468]5.4435,[469]5.4444,[470]5.4433,[471]5.4383,[472]5.4386,[473]5.4341,[474]5.4329,[475]5.4263,[476]5.4239,[477]5.4154,[478]5.4128,[479]5.4132,[480]5.4156,[481]5.4156,[482]5.4110,[483]5.4068,[484]5.4078,[485]5.4011,[486]5.3950,[487]5.3939,[488]5.3917,[489]5.3865,[490]5.3832,[491]5.3798,[492]5.3734,[493]5.3707,[494]5.3689,[495]5.3670,[496]5.3630,[497]5.3569,[498]5.3544,[499]5.3510,[500]5.3431,[501]5.3361,[502]5.3351,[503]5.3342,[504]5.3265,[505]5.3262,[506]5.3268,[507]5.3214,[508]5.3177,[509]5.3182,[510]5.3203,[511]5.3246,[512]5.3286,[513]5.3311,[514]5.3362,[515]5.3320,[516]5.3310,[517]5.3310,[518]5.3311,[519]5.3332,[520]5.3344,[521]5.3356,[522]5.3370,[523]5.3378,[524]5.3431,[525]5.3457,[526]5.3462,[527]5.3477,[528]5.3425,[529]5.3434,[530]5.3398,[531]5.3392,[532]5.3440,[533]5.3467,[534]5.3451,[535]5.3473,[536]5.3432,[537]5.3414,[538]5.3465,[539]5.3473,[540]5.3487,[541]5.3486,[542]5.3500,[543]5.3521,[544]5.3534,[545]5.3525,[546]5.3526,[547]5.3494,[548]5.3452,[549]5.3454,[550]5.3434,[551]5.3409,[552]5.3389,[553]5.3360,[554]5.3338,[555]5.3318,[556]5.3310,[557]5.3328,[558]5.3294,[559]5.3299,[560]5.3285,[561]5.3285,[562]5.3261,[563]5.3258,[564]5.3299,[565]5.3309,[566]5.3316,[567]5.3295,[568]5.3307,[569]5.3292,[570]5.3318,[571]5.3331,[572]5.3339,[573]5.3342,[574]5.3312,[575]5.3295,[576]5.3288,[577]5.3272,[578]5.3254,[579]5.3252,[580]5.3200,[581]5.3171,[582]5.3170,[583]5.3178,[584]5.3183,[585]5.3126,[586]5.3071,[587]5.3076,[588]5.3120,[589]5.3169,[590]5.3199,[591]5.3216,[592]5.3205,[593]5.3165,[594]5.3180,[595]5.3166,[596]5.3204,[597]5.3183,[598]5.3151,[599]5.3178,[600]5.3169,[601]5.3157,[602]5.3157,[603]5.3185,[604]5.3191,[605]5.3218,[606]5.3231,[607]5.3217,[608]5.3188,[609]5.3197,[610]5.3238,[611]5.3227,[612]5.3249,[613]5.3220,[614]5.3179,[615]5.3120,[616]5.3148,[617]5.3099,[618]5.3056,[619]5.3012,[620]5.2903,[621]5.2852,[622]5.2833,[623]5.2846,[624]5.2852,[625]5.2859,[626]5.2856,[627]5.2882,[628]5.2890,[629]5.2894,[630]5.2925,[631]5.2970,[632]5.3017,[633]5.3007,[634]5.3036,[635]5.3033,[636]5.2997,[637]5.2961,[638]5.2979,[639]5.2949,[640]5.2957,[641]5.2960,[642]5.3010,[643]5.3026,[644]5.3044,[645]5.3029,[646]5.3063,[647]5.3014,[648]5.3024,[649]5.3027,[650]5.3055,[651]5.3097,[652]5.3100,[653]5.3137,[654]5.3084,[655]5.3075,
llama_print_timings:        load time =  6119.84 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 1858813.21 ms / 335360 tokens (    5.54 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 1889707.90 ms

Will make another run, this time using RMSE optimization (i.e. same as the one in OP) and double-check the reported 5.3117 result. But if it is confirmed, it would indicate that the RMSE optimization in this case is actually making the result worse for some reason.

@ggerganov
Copy link
Owner

My result for 13B, using Q4_3 with RMSE opt. + F16 output is: 5.2962

This result I think makes more sense since it is inline with my expectation that I described here: #406 (reply in thread)

main: seed = 1682172642
llama.cpp: loading model from ../models/13B/ggml-model-q4_3-output-f16-rmse.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 6 (mostly Q4_3)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 9734493.73 KB
llama_model_load_internal: mem required  = 11554.34 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity : calculating perplexity over 655 chunks, batch_size=512
2.94 seconds per pass - ETA 32 minutes
[1]3.7362,[2]4.1744,[3]4.9576,[4]5.3621,[5]5.5410,[6]5.4788,[7]5.6392,[8]5.7500,[9]6.0088,[10]6.2366,[11]6.4228,[12]6.4859,[13]6.4491,[14]6.5428,[15]6.7439,[16]6.4225,[17]6.3396,[18]6.3169,[19]6.0233,[20]6.0024,[21]5.9256,[22]5.7530,[23]5.7201,[24]5.6258,[25]5.6327,[26]5.4845,[27]5.3094,[28]5.2083,[29]5.1320,[30]4.9981,[31]4.9567,[32]4.9675,[33]4.9237,[34]4.9636,[35]4.9806,[36]5.0033,[37]4.9960,[38]4.9915,[39]5.0202,[40]5.0616,[41]5.0862,[42]5.1202,[43]5.0861,[44]5.1307,[45]5.1348,[46]5.1096,[47]5.1370,[48]5.1183,[49]5.1225,[50]5.0927,[51]5.0998,[52]5.0920,[53]5.1385,[54]5.1290,[55]5.1113,[56]5.1311,[57]5.1489,[58]5.1710,[59]5.1904,[60]5.2260,[61]5.2188,[62]5.2735,[63]5.2982,[64]5.3100,[65]5.3463,[66]5.3455,[67]5.3634,[68]5.3761,[69]5.4045,[70]5.4349,[71]5.4582,[72]5.4919,[73]5.5385,[74]5.5451,[75]5.5550,[76]5.5687,[77]5.5802,[78]5.5664,[79]5.5933,[80]5.5871,[81]5.5951,[82]5.5919,[83]5.5466,[84]5.5365,[85]5.5301,[86]5.5156,[87]5.4509,[88]5.4070,[89]5.3858,[90]5.3750,[91]5.3960,[92]5.3922,[93]5.3940,[94]5.3927,[95]5.4193,[96]5.4162,[97]5.4128,[98]5.4089,[99]5.4020,[100]5.3994,[101]5.4223,[102]5.4177,[103]5.4330,[104]5.4377,[105]5.4390,[106]5.4531,[107]5.4517,[108]5.4666,[109]5.4659,[110]5.4606,[111]5.4784,[112]5.4950,[113]5.4943,[114]5.4930,[115]5.4972,[116]5.4852,[117]5.4847,[118]5.5081,[119]5.5259,[120]5.5549,[121]5.5701,[122]5.5912,[123]5.6276,[124]5.6452,[125]5.6402,[126]5.6758,[127]5.7086,[128]5.7369,[129]5.7256,[130]5.7341,[131]5.7301,[132]5.7257,[133]5.7132,[134]5.7222,[135]5.7222,[136]5.7139,[137]5.7100,[138]5.6974,[139]5.6896,[140]5.6884,[141]5.6614,[142]5.6575,[143]5.6327,[144]5.6168,[145]5.6083,[146]5.5972,[147]5.6019,[148]5.6050,[149]5.6019,[150]5.6011,[151]5.6057,[152]5.5999,[153]5.5903,[154]5.5846,[155]5.5908,[156]5.5892,[157]5.6045,[158]5.6062,[159]5.6072,[160]5.6110,[161]5.6225,[162]5.5972,[163]5.5878,[164]5.5676,[165]5.5427,[166]5.5196,[167]5.4880,[168]5.4613,[169]5.4483,[170]5.4390,[171]5.4185,[172]5.4062,[173]5.3930,[174]5.3661,[175]5.3457,[176]5.3327,[177]5.3162,[178]5.2963,[179]5.2832,[180]5.2757,[181]5.2597,[182]5.2438,[183]5.2320,[184]5.2312,[185]5.2241,[186]5.2253,[187]5.2309,[188]5.2284,[189]5.2448,[190]5.2451,[191]5.2620,[192]5.2756,[193]5.2900,[194]5.3014,[195]5.3208,[196]5.3325,[197]5.3513,[198]5.3647,[199]5.3667,[200]5.3676,[201]5.3610,[202]5.3735,[203]5.3792,[204]5.3744,[205]5.3834,[206]5.3888,[207]5.3851,[208]5.3906,[209]5.3943,[210]5.3998,[211]5.4100,[212]5.4163,[213]5.4254,[214]5.4288,[215]5.4319,[216]5.4438,[217]5.4603,[218]5.4738,[219]5.4735,[220]5.4706,[221]5.4657,[222]5.4658,[223]5.4597,[224]5.4532,[225]5.4496,[226]5.4696,[227]5.4756,[228]5.4828,[229]5.4899,[230]5.4862,[231]5.5013,[232]5.4910,[233]5.4762,[234]5.4620,[235]5.4403,[236]5.4352,[237]5.4269,[238]5.4304,[239]5.4193,[240]5.4102,[241]5.4136,[242]5.4153,[243]5.4147,[244]5.4049,[245]5.4014,[246]5.3912,[247]5.3816,[248]5.3755,[249]5.3723,[250]5.3757,[251]5.3675,[252]5.3627,[253]5.3537,[254]5.3493,[255]5.3401,[256]5.3237,[257]5.3136,[258]5.3070,[259]5.3062,[260]5.2981,[261]5.2931,[262]5.2891,[263]5.2843,[264]5.2605,[265]5.2605,[266]5.2575,[267]5.2515,[268]5.2580,[269]5.2572,[270]5.2580,[271]5.2640,[272]5.2668,[273]5.2678,[274]5.2685,[275]5.2745,[276]5.2801,[277]5.2921,[278]5.3005,[279]5.3085,[280]5.3122,[281]5.3216,[282]5.3269,[283]5.3391,[284]5.3474,[285]5.3553,[286]5.3679,[287]5.3645,[288]5.3696,[289]5.3634,[290]5.3495,[291]5.3367,[292]5.3234,[293]5.3117,[294]5.3125,[295]5.3125,[296]5.3172,[297]5.3161,[298]5.3181,[299]5.3160,[300]5.3074,[301]5.3077,[302]5.3015,[303]5.2931,[304]5.2860,[305]5.2835,[306]5.2733,[307]5.2761,[308]5.2769,[309]5.2637,[310]5.2612,[311]5.2570,[312]5.2585,[313]5.2533,[314]5.2514,[315]5.2387,[316]5.2343,[317]5.2222,[318]5.2060,[319]5.2165,[320]5.2273,[321]5.2322,[322]5.2293,[323]5.2237,[324]5.2220,[325]5.2315,[326]5.2329,[327]5.2335,[328]5.2373,[329]5.2422,[330]5.2444,[331]5.2547,[332]5.2512,[333]5.2586,[334]5.2541,[335]5.2490,[336]5.2513,[337]5.2502,[338]5.2501,[339]5.2458,[340]5.2431,[341]5.2495,[342]5.2528,[343]5.2568,[344]5.2571,[345]5.2586,[346]5.2569,[347]5.2604,[348]5.2641,[349]5.2661,[350]5.2642,[351]5.2655,[352]5.2658,[353]5.2604,[354]5.2612,[355]5.2661,[356]5.2691,[357]5.2663,[358]5.2743,[359]5.2762,[360]5.2728,[361]5.2725,[362]5.2792,[363]5.2900,[364]5.2951,[365]5.2990,[366]5.3007,[367]5.3094,[368]5.3074,[369]5.3089,[370]5.3109,[371]5.3069,[372]5.3116,[373]5.3154,[374]5.3134,[375]5.3129,[376]5.3187,[377]5.3151,[378]5.3176,[379]5.3211,[380]5.3144,[381]5.3114,[382]5.3077,[383]5.3059,[384]5.3061,[385]5.3048,[386]5.3036,[387]5.3034,[388]5.3007,[389]5.2969,[390]5.2918,[391]5.2859,[392]5.2825,[393]5.2821,[394]5.2854,[395]5.2847,[396]5.2796,[397]5.2859,[398]5.2901,[399]5.2971,[400]5.2966,[401]5.2974,[402]5.2986,[403]5.3011,[404]5.3066,[405]5.2918,[406]5.2877,[407]5.2867,[408]5.2875,[409]5.2985,[410]5.3076,[411]5.3169,[412]5.3308,[413]5.3409,[414]5.3470,[415]5.3528,[416]5.3598,[417]5.3696,[418]5.3721,[419]5.3769,[420]5.3844,[421]5.3942,[422]5.3975,[423]5.4033,[424]5.4122,[425]5.4199,[426]5.4259,[427]5.4301,[428]5.4373,[429]5.4410,[430]5.4472,[431]5.4596,[432]5.4627,[433]5.4620,[434]5.4587,[435]5.4601,[436]5.4629,[437]5.4710,[438]5.4782,[439]5.4755,[440]5.4748,[441]5.4704,[442]5.4692,[443]5.4702,[444]5.4721,[445]5.4712,[446]5.4733,[447]5.4756,[448]5.4788,[449]5.4773,[450]5.4784,[451]5.4755,[452]5.4599,[453]5.4503,[454]5.4450,[455]5.4453,[456]5.4495,[457]5.4508,[458]5.4491,[459]5.4490,[460]5.4563,[461]5.4523,[462]5.4489,[463]5.4468,[464]5.4465,[465]5.4443,[466]5.4369,[467]5.4360,[468]5.4340,[469]5.4352,[470]5.4341,[471]5.4292,[472]5.4299,[473]5.4251,[474]5.4240,[475]5.4171,[476]5.4148,[477]5.4065,[478]5.4035,[479]5.4036,[480]5.4061,[481]5.4062,[482]5.4015,[483]5.3973,[484]5.3980,[485]5.3913,[486]5.3849,[487]5.3837,[488]5.3815,[489]5.3761,[490]5.3730,[491]5.3698,[492]5.3630,[493]5.3604,[494]5.3585,[495]5.3561,[496]5.3521,[497]5.3457,[498]5.3431,[499]5.3395,[500]5.3314,[501]5.3245,[502]5.3236,[503]5.3225,[504]5.3149,[505]5.3145,[506]5.3150,[507]5.3097,[508]5.3060,[509]5.3066,[510]5.3088,[511]5.3130,[512]5.3170,[513]5.3194,[514]5.3248,[515]5.3208,[516]5.3198,[517]5.3197,[518]5.3198,[519]5.3219,[520]5.3233,[521]5.3245,[522]5.3258,[523]5.3265,[524]5.3319,[525]5.3347,[526]5.3353,[527]5.3369,[528]5.3314,[529]5.3323,[530]5.3287,[531]5.3282,[532]5.3329,[533]5.3356,[534]5.3337,[535]5.3357,[536]5.3316,[537]5.3298,[538]5.3347,[539]5.3355,[540]5.3371,[541]5.3369,[542]5.3382,[543]5.3404,[544]5.3416,[545]5.3406,[546]5.3408,[547]5.3375,[548]5.3334,[549]5.3334,[550]5.3313,[551]5.3286,[552]5.3266,[553]5.3238,[554]5.3216,[555]5.3197,[556]5.3189,[557]5.3208,[558]5.3175,[559]5.3178,[560]5.3164,[561]5.3166,[562]5.3141,[563]5.3140,[564]5.3182,[565]5.3194,[566]5.3201,[567]5.3182,[568]5.3192,[569]5.3177,[570]5.3204,[571]5.3216,[572]5.3224,[573]5.3228,[574]5.3200,[575]5.3184,[576]5.3177,[577]5.3163,[578]5.3144,[579]5.3144,[580]5.3091,[581]5.3061,[582]5.3061,[583]5.3069,[584]5.3074,[585]5.3016,[586]5.2962,[587]5.2965,[588]5.3008,[589]5.3058,[590]5.3088,[591]5.3105,[592]5.3094,[593]5.3054,[594]5.3068,[595]5.3052,[596]5.3091,[597]5.3071,[598]5.3039,[599]5.3065,[600]5.3056,[601]5.3045,[602]5.3046,[603]5.3073,[604]5.3079,[605]5.3106,[606]5.3120,[607]5.3105,[608]5.3077,[609]5.3086,[610]5.3126,[611]5.3116,[612]5.3138,[613]5.3109,[614]5.3070,[615]5.3011,[616]5.3036,[617]5.2986,[618]5.2942,[619]5.2898,[620]5.2789,[621]5.2739,[622]5.2721,[623]5.2734,[624]5.2737,[625]5.2745,[626]5.2741,[627]5.2768,[628]5.2776,[629]5.2780,[630]5.2811,[631]5.2854,[632]5.2902,[633]5.2891,[634]5.2920,[635]5.2918,[636]5.2884,[637]5.2848,[638]5.2868,[639]5.2838,[640]5.2845,[641]5.2849,[642]5.2899,[643]5.2915,[644]5.2933,[645]5.2919,[646]5.2953,[647]5.2902,[648]5.2913,[649]5.2914,[650]5.2943,[651]5.2983,[652]5.2987,[653]5.3024,[654]5.2970,[655]5.2962,
llama_print_timings:        load time =  6171.79 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 1846949.17 ms / 335360 tokens (    5.51 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time = 1878169.61 ms

@ikawrakow
Copy link
Contributor Author

@ggerganov Are these results with or without the changes you made to Q4_3 after I opened this PR (and reported the results)?

By default this new option is ON. One can turn it off
by setting LLAMA_NO_RMSE.

With this option enabled, the Q4_3 quantization results
in a perplexity  of 6.0344, so 0.0273 lower than simple
Q4_3 quantization.
Test does not work with RMSE-minimization enabled, so
have to put the test cases between ifdefs.
@ggerganov
Copy link
Owner

@ggerganov Are these results with or without the changes you made to Q4_3 after I opened this PR (and reported the results)?

It includes all changes from today related to Q4_3 quantization. Maybe this is the source of the difference, although it's still strange since the Q4_3 changes should just improve the performance. Of course, we cannot expect exact same results, but the difference is rather big. So not 100% sure. I can run one extra Q4_3 13B run with the build from yesterday to make sure

@ikawrakow
Copy link
Contributor Author

@ggerganov Rebased this branch on latest master, re-quantized, re-ran the perplexity. Now I get the lower result as well with OPEN_BLAS (5.2961, so actually 0.0001 lower than cuBLAS). So, something else has happened that positively impacts results. Another observation is that OPEN_BLAS and cuBLAS results are not identical as they are for the fp16 model. They are very close, but not exactly the same. See details.

Is it possible this affects this comment you made in #729 ?

perplexity : calculating perplexity over 655 chunks, batch_size=512 30.59 seconds per pass - ETA 5 hours 33 minutes [1]3.7363,[2]4.1741,[3]4.9573,[4]5.3622,[5]5.5408,[6]5.4786,[7]5.6388,[8]5.7498,[9]6.0085,[10]6.2361,[11]6.4224,[12]6.4857,[13]6.4488,[14]6.5426,[15]6.7435,[16]6.4223,[17]6.3394,[18]6.3167,[19]6.0232,[20]6.0023,[21]5.9256,[22]5.7530,[23]5.7200,[24]5.6257,[25]5.6326,[26]5.4844,[27]5.3093,[28]5.2082,[29]5.1320,[30]4.9981,[31]4.9567,[32]4.9675,[33]4.9237,[34]4.9636,[35]4.9806,[36]5.0033,[37]4.9960,[38]4.9914,[39]5.0201,[40]5.0615,[41]5.0861,[42]5.1202,[43]5.0861,[44]5.1306,[45]5.1347,[46]5.1095,[47]5.1369,[48]5.1183,[49]5.1225,[50]5.0926,[51]5.0997,[52]5.0919,[53]5.1384,[54]5.1290,[55]5.1113,[56]5.1310,[57]5.1488,[58]5.1709,[59]5.1902,[60]5.2259,[61]5.2187,[62]5.2734,[63]5.2981,[64]5.3100,[65]5.3462,[66]5.3454,[67]5.3633,[68]5.3760,[69]5.4044,[70]5.4348,[71]5.4581,[72]5.4918,[73]5.5384,[74]5.5450,[75]5.5549,[76]5.5686,[77]5.5801,[78]5.5663,[79]5.5932,[80]5.5870,[81]5.5950,[82]5.5918,[83]5.5465,[84]5.5364,[85]5.5300,[86]5.5155,[87]5.4508,[88]5.4069,[89]5.3857,[90]5.3749,[91]5.3959,[92]5.3921,[93]5.3939,[94]5.3926,[95]5.4191,[96]5.4161,[97]5.4127,[98]5.4088,[99]5.4019,[100]5.3993,[101]5.4222,[102]5.4176,[103]5.4329,[104]5.4376,[105]5.4389,[106]5.4529,[107]5.4516,[108]5.4665,[109]5.4657,[110]5.4605,[111]5.4783,[112]5.4949,[113]5.4942,[114]5.4929,[115]5.4971,[116]5.4851,[117]5.4846,[118]5.5080,[119]5.5258,[120]5.5548,[121]5.5700,[122]5.5911,[123]5.6275,[124]5.6451,[125]5.6401,[126]5.6757,[127]5.7085,[128]5.7368,[129]5.7255,[130]5.7340,[131]5.7300,[132]5.7256,[133]5.7132,[134]5.7221,[135]5.7221,[136]5.7139,[137]5.7100,[138]5.6973,[139]5.6895,[140]5.6883,[141]5.6613,[142]5.6574,[143]5.6326,[144]5.6167,[145]5.6083,[146]5.5972,[147]5.6019,[148]5.6049,[149]5.6018,[150]5.6011,[151]5.6057,[152]5.5998,[153]5.5902,[154]5.5846,[155]5.5907,[156]5.5891,[157]5.6045,[158]5.6061,[159]5.6071,[160]5.6109,[161]5.6225,[162]5.5971,[163]5.5877,[164]5.5676,[165]5.5426,[166]5.5195,[167]5.4880,[168]5.4612,[169]5.4483,[170]5.4389,[171]5.4184,[172]5.4062,[173]5.3929,[174]5.3660,[175]5.3457,[176]5.3327,[177]5.3161,[178]5.2963,[179]5.2832,[180]5.2757,[181]5.2596,[182]5.2438,[183]5.2319,[184]5.2311,[185]5.2240,[186]5.2252,[187]5.2308,[188]5.2284,[189]5.2447,[190]5.2451,[191]5.2619,[192]5.2755,[193]5.2900,[194]5.3014,[195]5.3208,[196]5.3324,[197]5.3513,[198]5.3647,[199]5.3667,[200]5.3676,[201]5.3610,[202]5.3734,[203]5.3792,[204]5.3744,[205]5.3834,[206]5.3888,[207]5.3851,[208]5.3906,[209]5.3943,[210]5.3998,[211]5.4100,[212]5.4164,[213]5.4254,[214]5.4288,[215]5.4319,[216]5.4438,[217]5.4603,[218]5.4738,[219]5.4735,[220]5.4706,[221]5.4657,[222]5.4658,[223]5.4597,[224]5.4532,[225]5.4496,[226]5.4696,[227]5.4756,[228]5.4828,[229]5.4899,[230]5.4862,[231]5.5013,[232]5.4910,[233]5.4762,[234]5.4620,[235]5.4403,[236]5.4352,[237]5.4269,[238]5.4303,[239]5.4192,[240]5.4102,[241]5.4136,[242]5.4153,[243]5.4147,[244]5.4049,[245]5.4014,[246]5.3912,[247]5.3815,[248]5.3755,[249]5.3722,[250]5.3757,[251]5.3675,[252]5.3626,[253]5.3537,[254]5.3492,[255]5.3401,[256]5.3237,[257]5.3136,[258]5.3070,[259]5.3062,[260]5.2981,[261]5.2931,[262]5.2890,[263]5.2843,[264]5.2605,[265]5.2605,[266]5.2575,[267]5.2514,[268]5.2580,[269]5.2572,[270]5.2580,[271]5.2640,[272]5.2668,[273]5.2678,[274]5.2685,[275]5.2744,[276]5.2801,[277]5.2921,[278]5.3005,[279]5.3085,[280]5.3122,[281]5.3216,[282]5.3269,[283]5.3390,[284]5.3473,[285]5.3553,[286]5.3679,[287]5.3645,[288]5.3696,[289]5.3634,[290]5.3495,[291]5.3367,[292]5.3234,[293]5.3117,[294]5.3125,[295]5.3125,[296]5.3172,[297]5.3161,[298]5.3181,[299]5.3160,[300]5.3074,[301]5.3077,[302]5.3015,[303]5.2931,[304]5.2860,[305]5.2835,[306]5.2733,[307]5.2761,[308]5.2769,[309]5.2637,[310]5.2612,[311]5.2570,[312]5.2585,[313]5.2533,[314]5.2515,[315]5.2387,[316]5.2343,[317]5.2222,[318]5.2060,[319]5.2165,[320]5.2273,[321]5.2322,[322]5.2293,[323]5.2238,[324]5.2220,[325]5.2316,[326]5.2329,[327]5.2335,[328]5.2373,[329]5.2422,[330]5.2445,[331]5.2547,[332]5.2512,[333]5.2586,[334]5.2541,[335]5.2490,[336]5.2514,[337]5.2502,[338]5.2501,[339]5.2458,[340]5.2431,[341]5.2495,[342]5.2528,[343]5.2568,[344]5.2571,[345]5.2586,[346]5.2569,[347]5.2604,[348]5.2641,[349]5.2661,[350]5.2642,[351]5.2656,[352]5.2658,[353]5.2604,[354]5.2612,[355]5.2661,[356]5.2691,[357]5.2663,[358]5.2744,[359]5.2762,[360]5.2728,[361]5.2725,[362]5.2792,[363]5.2900,[364]5.2951,[365]5.2990,[366]5.3008,[367]5.3094,[368]5.3074,[369]5.3089,[370]5.3109,[371]5.3069,[372]5.3116,[373]5.3154,[374]5.3134,[375]5.3129,[376]5.3187,[377]5.3152,[378]5.3176,[379]5.3211,[380]5.3144,[381]5.3114,[382]5.3077,[383]5.3059,[384]5.3061,[385]5.3048,[386]5.3036,[387]5.3034,[388]5.3007,[389]5.2969,[390]5.2918,[391]5.2859,[392]5.2826,[393]5.2821,[394]5.2854,[395]5.2847,[396]5.2796,[397]5.2859,[398]5.2901,[399]5.2971,[400]5.2966,[401]5.2974,[402]5.2986,[403]5.3011,[404]5.3066,[405]5.2918,[406]5.2876,[407]5.2867,[408]5.2875,[409]5.2985,[410]5.3076,[411]5.3169,[412]5.3308,[413]5.3409,[414]5.3470,[415]5.3528,[416]5.3598,[417]5.3696,[418]5.3721,[419]5.3769,[420]5.3844,[421]5.3942,[422]5.3975,[423]5.4033,[424]5.4122,[425]5.4199,[426]5.4259,[427]5.4301,[428]5.4373,[429]5.4410,[430]5.4472,[431]5.4596,[432]5.4627,[433]5.4620,[434]5.4587,[435]5.4601,[436]5.4629,[437]5.4710,[438]5.4782,[439]5.4755,[440]5.4748,[441]5.4704,[442]5.4692,[443]5.4702,[444]5.4721,[445]5.4712,[446]5.4733,[447]5.4756,[448]5.4788,[449]5.4773,[450]5.4784,[451]5.4755,[452]5.4599,[453]5.4503,[454]5.4450,[455]5.4453,[456]5.4495,[457]5.4508,[458]5.4491,[459]5.4490,[460]5.4563,[461]5.4523,[462]5.4489,[463]5.4468,[464]5.4465,[465]5.4443,[466]5.4369,[467]5.4360,[468]5.4340,[469]5.4352,[470]5.4341,[471]5.4292,[472]5.4299,[473]5.4251,[474]5.4239,[475]5.4171,[476]5.4147,[477]5.4064,[478]5.4035,[479]5.4036,[480]5.4060,[481]5.4062,[482]5.4015,[483]5.3973,[484]5.3980,[485]5.3913,[486]5.3848,[487]5.3836,[488]5.3814,[489]5.3761,[490]5.3730,[491]5.3697,[492]5.3630,[493]5.3603,[494]5.3584,[495]5.3561,[496]5.3521,[497]5.3457,[498]5.3430,[499]5.3394,[500]5.3313,[501]5.3245,[502]5.3235,[503]5.3225,[504]5.3148,[505]5.3145,[506]5.3150,[507]5.3097,[508]5.3060,[509]5.3065,[510]5.3088,[511]5.3130,[512]5.3169,[513]5.3194,[514]5.3247,[515]5.3207,[516]5.3197,[517]5.3197,[518]5.3197,[519]5.3219,[520]5.3233,[521]5.3244,[522]5.3258,[523]5.3265,[524]5.3319,[525]5.3347,[526]5.3352,[527]5.3368,[528]5.3313,[529]5.3323,[530]5.3286,[531]5.3281,[532]5.3329,[533]5.3356,[534]5.3337,[535]5.3356,[536]5.3315,[537]5.3298,[538]5.3346,[539]5.3354,[540]5.3370,[541]5.3368,[542]5.3382,[543]5.3403,[544]5.3415,[545]5.3405,[546]5.3407,[547]5.3375,[548]5.3334,[549]5.3334,[550]5.3312,[551]5.3286,[552]5.3266,[553]5.3238,[554]5.3216,[555]5.3196,[556]5.3189,[557]5.3207,[558]5.3174,[559]5.3177,[560]5.3164,[561]5.3166,[562]5.3141,[563]5.3139,[564]5.3182,[565]5.3194,[566]5.3201,[567]5.3182,[568]5.3192,[569]5.3177,[570]5.3203,[571]5.3216,[572]5.3224,[573]5.3228,[574]5.3200,[575]5.3184,[576]5.3177,[577]5.3162,[578]5.3144,[579]5.3144,[580]5.3090,[581]5.3061,[582]5.3061,[583]5.3068,[584]5.3073,[585]5.3016,[586]5.2962,[587]5.2965,[588]5.3007,[589]5.3057,[590]5.3087,[591]5.3104,[592]5.3093,[593]5.3053,[594]5.3068,[595]5.3052,[596]5.3090,[597]5.3070,[598]5.3039,[599]5.3065,[600]5.3056,[601]5.3044,[602]5.3045,[603]5.3073,[604]5.3079,[605]5.3105,[606]5.3119,[607]5.3105,[608]5.3077,[609]5.3085,[610]5.3126,[611]5.3115,[612]5.3138,[613]5.3109,[614]5.3070,[615]5.3010,[616]5.3035,[617]5.2985,[618]5.2942,[619]5.2898,[620]5.2789,[621]5.2738,[622]5.2720,[623]5.2733,[624]5.2736,[625]5.2744,[626]5.2741,[627]5.2767,[628]5.2776,[629]5.2779,[630]5.2811,[631]5.2853,[632]5.2901,[633]5.2890,[634]5.2920,[635]5.2917,[636]5.2883,[637]5.2848,[638]5.2868,[639]5.2838,[640]5.2844,[641]5.2849,[642]5.2898,[643]5.2915,[644]5.2932,[645]5.2919,[646]5.2953,[647]5.2901,[648]5.2912,[649]5.2914,[650]5.2942,[651]5.2982,[652]5.2987,[653]5.3024,[654]5.2969,[655]5.2961,

llama_print_timings: load time = 31077.61 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 9573081.81 ms / 335360 tokens ( 28.55 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 9604412.75 ms

@ggerganov
Copy link
Owner

ggerganov commented Apr 22, 2023

I think we cannot expect cuBLAS and OpenBLAS to be exactly the same because cuBLAS dequantizes x to F16 and casts y to F16 and performs F16 mat mul, while OpenBLAS dequantizes x to F32 and performs F32 mat mul (if I'm not mistaken)

@slaren
Copy link
Collaborator

slaren commented Apr 22, 2023

cuBLAS dequantizes x to F16 and casts y to F16 and performs F16 mat mul, while OpenBLAS dequantizes x to F32 and performs F32 mat mul (if I'm not mistaken)

That's not exactly the case, when multiplying q x f32, cuBLAS dequantizes to f32 and does a f32 x f32 mat mul. The only difference with OpenBLAS is when performing a f16 x f32 mat mul (ggml_compute_forward_mul_mat_f16_f32). In this case, src1 is converted to f16 instead of converting src0 to f32, and a f16 x f16 mat mul is done.

@ikawrakow
Copy link
Contributor Author

@ggerganov I propose we close this PR. Although there is some benefit from rmse minimization for QX_1 and QX_3 quantization of the 7B model, the benefit mostly goes away for 13B (and Q5_1 is actually worse with rmse minimization that without at 13B).

@ivanstepanovftw
Copy link
Collaborator

You are minimizing error - why it should be worse? It may be worse for one, but better for another case, no?

@ivanstepanovftw
Copy link
Collaborator

ivanstepanovftw commented May 3, 2023

By that I mean that perplexity for wide range of other files (than en-wikitext--whatever) may be better. And not for one model but for another...

Quantization itself is here is to compress data as much as possible without affecting model's quality much.

@ggerganov ggerganov closed this May 3, 2023
@ikawrakow ikawrakow deleted the ik/rmse_quantization branch June 11, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
generation quality Quality of model output high priority Very important issue
Development

Successfully merging this pull request may close these issues.

7 participants