-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RMSE-optimized quants for all quantization types #1106
Conversation
sounds like a good idea. for me personally io is the bottleneck, since i store them on a NAS. |
364c00a
to
7ca90a8
Compare
It might be a good idea to get #953 merged first, which implements unit tests for the quantization. But that requires an improvement to the test samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update SHA256SUMS
, at the very least remove the files which are now different.
float scale = sumlx/suml2; | ||
return scale; | ||
} | ||
static float kquantize_q4_with_bound_plus(int n, int nmax, const float * restrict X, int nCandidates, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does _plus
mean?
Couldn't you re-use kquantize_q4_with_bounds
with nmin=0
?
I'm still a bit skeptical if chasing after RMSE is the right thing to do. Let me explain what I mean: originally the Q4 methods calculate max(abs()) and divide that by 7. What if it actually helps perplexity if we clip the largest values somewhat, even if that comes at a higher RMS error?
So the approach to find that would be use #729, choose a value in the interesting range of maybe [7,11], quantize the model, do a perplexity run, lather, rinse, repeat. |
Just made a full cuBLAS run on 13B using
Will make another run, this time using RMSE optimization (i.e. same as the one in OP) and double-check the reported |
My result for 13B, using This result I think makes more sense since it is inline with my expectation that I described here: #406 (reply in thread)
|
@ggerganov Are these results with or without the changes you made to |
By default this new option is ON. One can turn it off by setting LLAMA_NO_RMSE. With this option enabled, the Q4_3 quantization results in a perplexity of 6.0344, so 0.0273 lower than simple Q4_3 quantization.
Test does not work with RMSE-minimization enabled, so have to put the test cases between ifdefs.
It includes all changes from today related to |
@ggerganov Rebased this branch on latest master, re-quantized, re-ran the perplexity. Now I get the lower result as well with OPEN_BLAS ( Is it possible this affects this comment you made in #729 ?
perplexity : calculating perplexity over 655 chunks, batch_size=512
30.59 seconds per pass - ETA 5 hours 33 minutes
[1]3.7363,[2]4.1741,[3]4.9573,[4]5.3622,[5]5.5408,[6]5.4786,[7]5.6388,[8]5.7498,[9]6.0085,[10]6.2361,[11]6.4224,[12]6.4857,[13]6.4488,[14]6.5426,[15]6.7435,[16]6.4223,[17]6.3394,[18]6.3167,[19]6.0232,[20]6.0023,[21]5.9256,[22]5.7530,[23]5.7200,[24]5.6257,[25]5.6326,[26]5.4844,[27]5.3093,[28]5.2082,[29]5.1320,[30]4.9981,[31]4.9567,[32]4.9675,[33]4.9237,[34]4.9636,[35]4.9806,[36]5.0033,[37]4.9960,[38]4.9914,[39]5.0201,[40]5.0615,[41]5.0861,[42]5.1202,[43]5.0861,[44]5.1306,[45]5.1347,[46]5.1095,[47]5.1369,[48]5.1183,[49]5.1225,[50]5.0926,[51]5.0997,[52]5.0919,[53]5.1384,[54]5.1290,[55]5.1113,[56]5.1310,[57]5.1488,[58]5.1709,[59]5.1902,[60]5.2259,[61]5.2187,[62]5.2734,[63]5.2981,[64]5.3100,[65]5.3462,[66]5.3454,[67]5.3633,[68]5.3760,[69]5.4044,[70]5.4348,[71]5.4581,[72]5.4918,[73]5.5384,[74]5.5450,[75]5.5549,[76]5.5686,[77]5.5801,[78]5.5663,[79]5.5932,[80]5.5870,[81]5.5950,[82]5.5918,[83]5.5465,[84]5.5364,[85]5.5300,[86]5.5155,[87]5.4508,[88]5.4069,[89]5.3857,[90]5.3749,[91]5.3959,[92]5.3921,[93]5.3939,[94]5.3926,[95]5.4191,[96]5.4161,[97]5.4127,[98]5.4088,[99]5.4019,[100]5.3993,[101]5.4222,[102]5.4176,[103]5.4329,[104]5.4376,[105]5.4389,[106]5.4529,[107]5.4516,[108]5.4665,[109]5.4657,[110]5.4605,[111]5.4783,[112]5.4949,[113]5.4942,[114]5.4929,[115]5.4971,[116]5.4851,[117]5.4846,[118]5.5080,[119]5.5258,[120]5.5548,[121]5.5700,[122]5.5911,[123]5.6275,[124]5.6451,[125]5.6401,[126]5.6757,[127]5.7085,[128]5.7368,[129]5.7255,[130]5.7340,[131]5.7300,[132]5.7256,[133]5.7132,[134]5.7221,[135]5.7221,[136]5.7139,[137]5.7100,[138]5.6973,[139]5.6895,[140]5.6883,[141]5.6613,[142]5.6574,[143]5.6326,[144]5.6167,[145]5.6083,[146]5.5972,[147]5.6019,[148]5.6049,[149]5.6018,[150]5.6011,[151]5.6057,[152]5.5998,[153]5.5902,[154]5.5846,[155]5.5907,[156]5.5891,[157]5.6045,[158]5.6061,[159]5.6071,[160]5.6109,[161]5.6225,[162]5.5971,[163]5.5877,[164]5.5676,[165]5.5426,[166]5.5195,[167]5.4880,[168]5.4612,[169]5.4483,[170]5.4389,[171]5.4184,[172]5.4062,[173]5.3929,[174]5.3660,[175]5.3457,[176]5.3327,[177]5.3161,[178]5.2963,[179]5.2832,[180]5.2757,[181]5.2596,[182]5.2438,[183]5.2319,[184]5.2311,[185]5.2240,[186]5.2252,[187]5.2308,[188]5.2284,[189]5.2447,[190]5.2451,[191]5.2619,[192]5.2755,[193]5.2900,[194]5.3014,[195]5.3208,[196]5.3324,[197]5.3513,[198]5.3647,[199]5.3667,[200]5.3676,[201]5.3610,[202]5.3734,[203]5.3792,[204]5.3744,[205]5.3834,[206]5.3888,[207]5.3851,[208]5.3906,[209]5.3943,[210]5.3998,[211]5.4100,[212]5.4164,[213]5.4254,[214]5.4288,[215]5.4319,[216]5.4438,[217]5.4603,[218]5.4738,[219]5.4735,[220]5.4706,[221]5.4657,[222]5.4658,[223]5.4597,[224]5.4532,[225]5.4496,[226]5.4696,[227]5.4756,[228]5.4828,[229]5.4899,[230]5.4862,[231]5.5013,[232]5.4910,[233]5.4762,[234]5.4620,[235]5.4403,[236]5.4352,[237]5.4269,[238]5.4303,[239]5.4192,[240]5.4102,[241]5.4136,[242]5.4153,[243]5.4147,[244]5.4049,[245]5.4014,[246]5.3912,[247]5.3815,[248]5.3755,[249]5.3722,[250]5.3757,[251]5.3675,[252]5.3626,[253]5.3537,[254]5.3492,[255]5.3401,[256]5.3237,[257]5.3136,[258]5.3070,[259]5.3062,[260]5.2981,[261]5.2931,[262]5.2890,[263]5.2843,[264]5.2605,[265]5.2605,[266]5.2575,[267]5.2514,[268]5.2580,[269]5.2572,[270]5.2580,[271]5.2640,[272]5.2668,[273]5.2678,[274]5.2685,[275]5.2744,[276]5.2801,[277]5.2921,[278]5.3005,[279]5.3085,[280]5.3122,[281]5.3216,[282]5.3269,[283]5.3390,[284]5.3473,[285]5.3553,[286]5.3679,[287]5.3645,[288]5.3696,[289]5.3634,[290]5.3495,[291]5.3367,[292]5.3234,[293]5.3117,[294]5.3125,[295]5.3125,[296]5.3172,[297]5.3161,[298]5.3181,[299]5.3160,[300]5.3074,[301]5.3077,[302]5.3015,[303]5.2931,[304]5.2860,[305]5.2835,[306]5.2733,[307]5.2761,[308]5.2769,[309]5.2637,[310]5.2612,[311]5.2570,[312]5.2585,[313]5.2533,[314]5.2515,[315]5.2387,[316]5.2343,[317]5.2222,[318]5.2060,[319]5.2165,[320]5.2273,[321]5.2322,[322]5.2293,[323]5.2238,[324]5.2220,[325]5.2316,[326]5.2329,[327]5.2335,[328]5.2373,[329]5.2422,[330]5.2445,[331]5.2547,[332]5.2512,[333]5.2586,[334]5.2541,[335]5.2490,[336]5.2514,[337]5.2502,[338]5.2501,[339]5.2458,[340]5.2431,[341]5.2495,[342]5.2528,[343]5.2568,[344]5.2571,[345]5.2586,[346]5.2569,[347]5.2604,[348]5.2641,[349]5.2661,[350]5.2642,[351]5.2656,[352]5.2658,[353]5.2604,[354]5.2612,[355]5.2661,[356]5.2691,[357]5.2663,[358]5.2744,[359]5.2762,[360]5.2728,[361]5.2725,[362]5.2792,[363]5.2900,[364]5.2951,[365]5.2990,[366]5.3008,[367]5.3094,[368]5.3074,[369]5.3089,[370]5.3109,[371]5.3069,[372]5.3116,[373]5.3154,[374]5.3134,[375]5.3129,[376]5.3187,[377]5.3152,[378]5.3176,[379]5.3211,[380]5.3144,[381]5.3114,[382]5.3077,[383]5.3059,[384]5.3061,[385]5.3048,[386]5.3036,[387]5.3034,[388]5.3007,[389]5.2969,[390]5.2918,[391]5.2859,[392]5.2826,[393]5.2821,[394]5.2854,[395]5.2847,[396]5.2796,[397]5.2859,[398]5.2901,[399]5.2971,[400]5.2966,[401]5.2974,[402]5.2986,[403]5.3011,[404]5.3066,[405]5.2918,[406]5.2876,[407]5.2867,[408]5.2875,[409]5.2985,[410]5.3076,[411]5.3169,[412]5.3308,[413]5.3409,[414]5.3470,[415]5.3528,[416]5.3598,[417]5.3696,[418]5.3721,[419]5.3769,[420]5.3844,[421]5.3942,[422]5.3975,[423]5.4033,[424]5.4122,[425]5.4199,[426]5.4259,[427]5.4301,[428]5.4373,[429]5.4410,[430]5.4472,[431]5.4596,[432]5.4627,[433]5.4620,[434]5.4587,[435]5.4601,[436]5.4629,[437]5.4710,[438]5.4782,[439]5.4755,[440]5.4748,[441]5.4704,[442]5.4692,[443]5.4702,[444]5.4721,[445]5.4712,[446]5.4733,[447]5.4756,[448]5.4788,[449]5.4773,[450]5.4784,[451]5.4755,[452]5.4599,[453]5.4503,[454]5.4450,[455]5.4453,[456]5.4495,[457]5.4508,[458]5.4491,[459]5.4490,[460]5.4563,[461]5.4523,[462]5.4489,[463]5.4468,[464]5.4465,[465]5.4443,[466]5.4369,[467]5.4360,[468]5.4340,[469]5.4352,[470]5.4341,[471]5.4292,[472]5.4299,[473]5.4251,[474]5.4239,[475]5.4171,[476]5.4147,[477]5.4064,[478]5.4035,[479]5.4036,[480]5.4060,[481]5.4062,[482]5.4015,[483]5.3973,[484]5.3980,[485]5.3913,[486]5.3848,[487]5.3836,[488]5.3814,[489]5.3761,[490]5.3730,[491]5.3697,[492]5.3630,[493]5.3603,[494]5.3584,[495]5.3561,[496]5.3521,[497]5.3457,[498]5.3430,[499]5.3394,[500]5.3313,[501]5.3245,[502]5.3235,[503]5.3225,[504]5.3148,[505]5.3145,[506]5.3150,[507]5.3097,[508]5.3060,[509]5.3065,[510]5.3088,[511]5.3130,[512]5.3169,[513]5.3194,[514]5.3247,[515]5.3207,[516]5.3197,[517]5.3197,[518]5.3197,[519]5.3219,[520]5.3233,[521]5.3244,[522]5.3258,[523]5.3265,[524]5.3319,[525]5.3347,[526]5.3352,[527]5.3368,[528]5.3313,[529]5.3323,[530]5.3286,[531]5.3281,[532]5.3329,[533]5.3356,[534]5.3337,[535]5.3356,[536]5.3315,[537]5.3298,[538]5.3346,[539]5.3354,[540]5.3370,[541]5.3368,[542]5.3382,[543]5.3403,[544]5.3415,[545]5.3405,[546]5.3407,[547]5.3375,[548]5.3334,[549]5.3334,[550]5.3312,[551]5.3286,[552]5.3266,[553]5.3238,[554]5.3216,[555]5.3196,[556]5.3189,[557]5.3207,[558]5.3174,[559]5.3177,[560]5.3164,[561]5.3166,[562]5.3141,[563]5.3139,[564]5.3182,[565]5.3194,[566]5.3201,[567]5.3182,[568]5.3192,[569]5.3177,[570]5.3203,[571]5.3216,[572]5.3224,[573]5.3228,[574]5.3200,[575]5.3184,[576]5.3177,[577]5.3162,[578]5.3144,[579]5.3144,[580]5.3090,[581]5.3061,[582]5.3061,[583]5.3068,[584]5.3073,[585]5.3016,[586]5.2962,[587]5.2965,[588]5.3007,[589]5.3057,[590]5.3087,[591]5.3104,[592]5.3093,[593]5.3053,[594]5.3068,[595]5.3052,[596]5.3090,[597]5.3070,[598]5.3039,[599]5.3065,[600]5.3056,[601]5.3044,[602]5.3045,[603]5.3073,[604]5.3079,[605]5.3105,[606]5.3119,[607]5.3105,[608]5.3077,[609]5.3085,[610]5.3126,[611]5.3115,[612]5.3138,[613]5.3109,[614]5.3070,[615]5.3010,[616]5.3035,[617]5.2985,[618]5.2942,[619]5.2898,[620]5.2789,[621]5.2738,[622]5.2720,[623]5.2733,[624]5.2736,[625]5.2744,[626]5.2741,[627]5.2767,[628]5.2776,[629]5.2779,[630]5.2811,[631]5.2853,[632]5.2901,[633]5.2890,[634]5.2920,[635]5.2917,[636]5.2883,[637]5.2848,[638]5.2868,[639]5.2838,[640]5.2844,[641]5.2849,[642]5.2898,[643]5.2915,[644]5.2932,[645]5.2919,[646]5.2953,[647]5.2901,[648]5.2912,[649]5.2914,[650]5.2942,[651]5.2982,[652]5.2987,[653]5.3024,[654]5.2969,[655]5.2961,
llama_print_timings: load time = 31077.61 ms |
I think we cannot expect cuBLAS and OpenBLAS to be exactly the same because cuBLAS dequantizes |
That's not exactly the case, when multiplying q x f32, cuBLAS dequantizes to f32 and does a f32 x f32 mat mul. The only difference with OpenBLAS is when performing a f16 x f32 mat mul ( |
7ca90a8
to
6fd49ed
Compare
@ggerganov I propose we close this PR. Although there is some benefit from rmse minimization for |
You are minimizing error - why it should be worse? It may be worse for one, but better for another case, no? |
By that I mean that perplexity for wide range of other files (than en-wikitext--whatever) may be better. And not for one model but for another... Quantization itself is here is to compress data as much as possible without affecting model's quality much. |
The PR adds a new build option (
LLAMA_NO_RMSE
), which is off by default. When off, all current quantization types (Q4_0, Q4_1, Q4_2, Q4_3
) are performed with RMSE minimization (on master RMSE minimization is enabled forQ4_2
only and cannot easily be disabled).This makes generation of quantized models quite a bit longer, but still in the same ballpark as it used to take before it was multi-threaded in PR #1075.
With this option enabled,
Q4_3
gives a perplexity of6.0344
for the 7B model, so 0.0273 lower than simpleQ4_3
quantization as reported by @ggerganov in #406. If I also enable his trick of not quantizing output tensors, perplexity becomes6.0085
.Perplexity result for
Q4_3
without quantization of output tensors for the 13B model is5.3117
.Details for these perplexity runs can be found in here (issue #406)
As far as I can tell, we are now on par with best known
GPTQ
result for 7B, and better for 13B by about0.05
.