Is there any perplexity data for using 16bit vs 32bit memory? #1593
-
I'm talking about It seems like the general consensus is that there's no noticeable difference for actual models. In fact, based on the quantization section in the README there's virtually no difference between 16bit and |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 1 reply
-
I did some of my own testing. The conclusion looks to be that there is no effective difference. It doesn't seem like there's a case for ever using Not sure if it makes a difference but the test ran with cuBLAS enabled and offloading some layers. Running the perplexity calculation on LLaMA 7b Q4_0: 32bit memory[1]4.4544,[2]4.9400,[3]5.8279,[4]6.4844,[5]6.5856,[6]6.5088,[7]6.6927,[8]6.8060,[9]7.1427,[10]7.3866 [...] [207]6.1957,[208]6.2042,[209]6.2087,[210]6.2146,[211]6.2247,[212]6.2317,[213]6.2420,[214]6.2449,[215]6.2478,[216]6.2612 16bit memory[1]4.4544,[2]4.9400,[3]5.8279,[4]6.4844,[5]6.5856,[6]6.5088,[7]6.6927,[8]6.8060,[9]7.1427,[10]7.3866 [...] [207]6.1957,[208]6.2042,[209]6.2087,[210]6.2146,[211]6.2247,[212]6.2317,[213]6.2420,[214]6.2449,[215]6.2478,[216]6.2612 I also did a very short test with the Q8_0 version just to see if maybe the difference was lost in the noise possibly caused by lower quality quantization: 32bit memory[1]4.2284,[2]4.7007,[3]5.5711,[4]6.1757,[5]6.2967,[6]6.2677,[7]6.4631,[8]6.5548,[9]6.8742,[10]7.1204,[11]7.3161,[12]7.3371,[13]7.2474,[14]7.2943,[15]7.5318,[16]7.1632,[17]7.0561,[18]7.0044,[19]6.6580,[20]6.6455, 16bit memory[1]4.2285,[2]4.7009,[3]5.5714,[4]6.1760,[5]6.2969,[6]6.2679,[7]6.4635,[8]6.5551,[9]6.8744,[10]7.1206,[11]7.3163,[12]7.3373,[13]7.2475,[14]7.2944,[15]7.5320,[16]7.1634,[17]7.0563,[18]7.0046,[19]6.6582,[20]6.6457 |
Beta Was this translation helpful? Give feedback.
-
I guess the next question would be: Is there any reason to keep the There should probably at least be an indication for the user that it doesn't actually increase quality in a way that's even measurable but uses twice as much memory. |
Beta Was this translation helpful? Give feedback.
-
I have been using |
Beta Was this translation helpful? Give feedback.
-
The fact that there's no measurable difference in perplexity makes me inclined to say this is most likely confirmation bias/placebo effect. If there's a real difference it should be measurable. |
Beta Was this translation helpful? Give feedback.
I did some of my own testing. The conclusion looks to be that there is no effective difference. It doesn't seem like there's a case for ever using
--memory-f32
.Not sure if it makes a difference but the test ran with cuBLAS enabled and offloading some layers.
Running the perplexity calculation on LLaMA 7b Q4_0:
32bit memory
[1]4.4544,[2]4.9400,[3]5.8279,[4]6.4844,[5]6.5856,[6]6.5088,[7]6.6927,[8]6.8060,[9]7.1427,[10]7.3866
[...]
[207]6.1957,[208]6.2042,[209]6.2087,[210]6.2146,[211]6.2247,[212]6.2317,[213]6.2420,[214]6.2449,[215]6.2478,[216]6.2612
16bit memory
[1]4.4544,[2]4.9400,[3]5.8279,[4]6.4844,[5]6.5856,[6]6.5088,[7]6.6927,[8]6.8060,[9]7.1427,[10]7.3866
[...]
[207]6.1957,[208]6.2042,…