Quantizing manually #118

rohan-tan-bhowmik · 2024-07-25T18:57:48Z

The output when I print dataset.root_theta._tree for llama-3b

{'token_embd': {'weight': PrimitiveTensor(token_embd.weight, [32000, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'blk': {'0': {'attn_q': {'weight': PrimitiveTensor(blk.0.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.0.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.0.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.0.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.0.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.0.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.0.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.0.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.0.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '1': {'attn_q': {'weight': PrimitiveTensor(blk.1.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.1.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.1.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.1.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.1.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.1.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.1.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.1.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.1.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '2': {'attn_q': {'weight': PrimitiveTensor(blk.2.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.2.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.2.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.2.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.2.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.2.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.2.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.2.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.2.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '3': {'attn_q': {'weight': PrimitiveTensor(blk.3.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.3.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.3.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.3.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.3.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.3.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.3.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.3.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.3.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '4': {'attn_q': {'weight': PrimitiveTensor(blk.4.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.4.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.4.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.4.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.4.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.4.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.4.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.4.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.4.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '5': {'attn_q': {'weight': PrimitiveTensor(blk.5.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.5.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.5.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.5.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.5.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.5.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.5.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.5.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.5.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '6': {'attn_q': {'weight': PrimitiveTensor(blk.6.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.6.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.6.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.6.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.6.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.6.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.6.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.6.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.6.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '7': {'attn_q': {'weight': PrimitiveTensor(blk.7.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.7.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.7.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.7.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.7.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.7.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.7.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.7.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.7.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '8': {'attn_q': {'weight': PrimitiveTensor(blk.8.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.8.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.8.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.8.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.8.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.8.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.8.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.8.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.8.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '9': {'attn_q': {'weight': PrimitiveTensor(blk.9.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.9.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.9.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.9.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.9.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.9.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.9.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.9.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.9.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '10': {'attn_q': {'weight': PrimitiveTensor(blk.10.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.10.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.10.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.10.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.10.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.10.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.10.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.10.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.10.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '11': {'attn_q': {'weight': PrimitiveTensor(blk.11.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.11.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.11.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.11.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.11.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.11.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.11.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.11.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.11.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '12': {'attn_q': {'weight': PrimitiveTensor(blk.12.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.12.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.12.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.12.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.12.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.12.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.12.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.12.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.12.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '13': {'attn_q': {'weight': PrimitiveTensor(blk.13.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.13.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.13.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.13.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.13.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.13.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.13.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.13.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.13.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '14': {'attn_q': {'weight': PrimitiveTensor(blk.14.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.14.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.14.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.14.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.14.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.14.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.14.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.14.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.14.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '15': {'attn_q': {'weight': PrimitiveTensor(blk.15.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.15.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.15.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.15.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.15.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.15.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.15.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.15.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.15.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '16': {'attn_q': {'weight': PrimitiveTensor(blk.16.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.16.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.16.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.16.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.16.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.16.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.16.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.16.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.16.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '17': {'attn_q': {'weight': PrimitiveTensor(blk.17.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.17.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.17.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.17.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.17.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.17.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.17.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.17.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.17.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '18': {'attn_q': {'weight': PrimitiveTensor(blk.18.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.18.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.18.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.18.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.18.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.18.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.18.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.18.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.18.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '19': {'attn_q': {'weight': PrimitiveTensor(blk.19.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.19.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.19.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.19.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.19.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.19.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.19.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.19.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.19.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '20': {'attn_q': {'weight': PrimitiveTensor(blk.20.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.20.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.20.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.20.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.20.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.20.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.20.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.20.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.20.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '21': {'attn_q': {'weight': PrimitiveTensor(blk.21.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.21.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.21.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.21.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.21.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.21.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.21.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.21.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.21.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '22': {'attn_q': {'weight': PrimitiveTensor(blk.22.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.22.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.22.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.22.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.22.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.22.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.22.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.22.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.22.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '23': {'attn_q': {'weight': PrimitiveTensor(blk.23.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.23.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.23.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.23.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.23.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.23.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.23.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.23.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.23.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '24': {'attn_q': {'weight': PrimitiveTensor(blk.24.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.24.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.24.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.24.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.24.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.24.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.24.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.24.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.24.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, '25': {'attn_q': {'weight': PrimitiveTensor(blk.25.attn_q.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_k': {'weight': PrimitiveTensor(blk.25.attn_k.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_v': {'weight': PrimitiveTensor(blk.25.attn_v.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_output': {'weight': PrimitiveTensor(blk.25.attn_output.weight, [3200, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_gate': {'weight': PrimitiveTensor(blk.25.ffn_gate.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_down': {'weight': PrimitiveTensor(blk.25.ffn_down.weight, [3200, 8640], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_up': {'weight': PrimitiveTensor(blk.25.ffn_up.weight, [8640, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'attn_norm': {'weight': PrimitiveTensor(blk.25.attn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'ffn_norm': {'weight': PrimitiveTensor(blk.25.ffn_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'output_norm': {'weight': PrimitiveTensor(output_norm.weight, [3200], torch.float32), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'output': {'weight': PrimitiveTensor(output.weight, [32000, 3200], torch.float16), 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}, 'q_input': DynamicScaledQuantizer(q_input) -> dtype=torch.float8_e4m3fn)}

its ugly

TODO: split qkv prior to irpa

dan-garvey

After seeing this and thinking about it, how I would approach your task generally is:

1.Look at how llama is generated from weights under models.
2. Make a small case where you only have attn weights and generate just the attention part.
3. play with the quantization like you're doing here.
4. finish a small test case
5. actually try incorporating into a model.

@rsuderman do you think that makes sense?

dan-garvey · 2024-07-25T19:10:33Z

sharktank/sharktank/examples/paged_llm_v1.py

+    parser.add_argument(
+        "--use-fp8-quantization",
+        help="DType to use for activations in the model",
+        default="false",


check out optionalBooleanAction or something like that, you don't have to do all the postprocess parsing like line 241

dan-garvey · 2024-07-25T19:11:43Z

sharktank/sharktank/examples/paged_llm_v1.py

@@ -201,6 +201,20 @@ def pad_block_ids(self) -> torch.Tensor:
        return torch.tensor(rows, device=self.parent.model.device)


+def quantize_theta(theta):
+    if isinstance(theta, Theta) or isinstance(theta, dict):
+        if "q_input" not in (theta._tree if isinstance(theta, Theta) else theta):


can probably do all this if isinstance stuff once at the beginning by assigning the result to variable, much more readable

dan-garvey and others added 9 commits July 19, 2024 15:57

(WIP) llama fp8 safetensor conversion

03637ef

add in changes for loading model.

3e40a53

its ugly

move file

ada5cac

re-add original

2e5e8f8

fix importer

ca28354

TODO: split qkv prior to irpa

dont quantize quantized parameters facedesk

98607ff

datatype

c8104e9

add some fixes to run

4ed3c9d

Quantizing manually

5ec5c0c

dan-garvey reviewed Jul 25, 2024

View reviewed changes

dan-garvey force-pushed the llama_fp8 branch from da0f0a2 to d2f81d4 Compare August 13, 2024 20:34

dan-garvey force-pushed the llama_fp8 branch from 6b85967 to 7f3c963 Compare August 28, 2024 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantizing manually #118

Quantizing manually #118

rohan-tan-bhowmik commented Jul 25, 2024 •

edited

Loading

dan-garvey left a comment

dan-garvey Jul 25, 2024

dan-garvey Jul 25, 2024

Quantizing manually #118

Are you sure you want to change the base?

Quantizing manually #118

Conversation

rohan-tan-bhowmik commented Jul 25, 2024 • edited Loading

dan-garvey left a comment

Choose a reason for hiding this comment

dan-garvey Jul 25, 2024

Choose a reason for hiding this comment

dan-garvey Jul 25, 2024

Choose a reason for hiding this comment

rohan-tan-bhowmik commented Jul 25, 2024 •

edited

Loading