Improve primitives for FP6 quant #248

gau-nernst · 2024-05-16T07:36:03Z

Address #208

TODO:

FP32/FP16/BF16 -> FP6
FP6 -> FP32/FP16/BF16
Add tests

On (8192, 8192) tensor. Ryzen 5600 and 4070Ti SUPER

device	dtype	op	time (m/s)
CPU	FP16->FP6	original	1140.27
CPU	FP16->FP6	ours	384.479
CPU	FP16->FP6	original (num_threads=4)	977.523
CPU	FP16->FP6	ours (num_threads=4)	98.3557
CPU	FP32->FP6	original	1033.14
CPU	FP32->FP6	ours	374.142
CPU	FP32->FP6	original (num_threads=4)	934.211
CPU	FP32->FP6	ours (num_threads=4)	95.7996
CUDA	FP16->FP6	ours	0.325222
CUDA	FP32->FP6	ours	0.639134

NOTE:

original is torchao.ops.fp16_to_fp6_original() (from original FP6-LLM repo + qtorch quantization logic). This does not support CUDA.
On CPU, there is a faster algorithm using only bit shift. But it cannot be implemented efficiently with PyTorch+torch.compile

(8192, 8192) FP6 input. Ryzen 5600 and 4070Ti SUPER.

device	dtype	op	time (m/s)
CPU	FP6->FP32	original	372.076
CPU	FP6->FP32	ours	127.714
CPU	FP6->FP32	original (num_threads=4)	375.183
CPU	FP6->FP32	ours (num_threads=4)	44.1857
CUDA	FP6->FP32	ours	0.572355

NOTE:

fp6_weight_dequant() (original implementation) is slow probably because the author use CUDA intrinsics __float2half() and __half2float() on CPU, which have to be implemented via bit manipulation.

pytorch-bot · 2024-05-16T07:36:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/248

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 78e79ac with merge base a7bc592 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2024-05-22T05:28:29Z

torchao/dtypes/fp6.py

+    return torch.stack([bits0, bits1, bits2], dim=-1).flatten(-2)
+
+
+def to_fp6(tensor: Tensor, no_bit_packing: bool = False) -> Tensor:


nit: thoughts about naming your dtype float6_e3m2 instead of fp6? This is to be consistent with naming for other PyTorch low precision dtypes such as float8_e4m3|e5m2 from PyTorch core as well as the upcoming MX dtypes, which include float6_e3m2 and float6_e2m3.

I was thinking the same thing too! Will update the name.

where can I read more about MX dtypes? This particular FP6 used by FP6-LLM paper does not represent +/-inf and NaN, so not sure if we should signal that in the name somehow too? (like float8_e4m3fn)

You can check out https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf, page 12 describes the supported float6 flavors. I plan to add the the mx code in torchao soon.

For the fn suffix...I'm planning to follow the OCP spec naming, which does not include naming qualifiers for special value handling, and replace fp with float to be consistent with other PyTorch dtype names. I think the fn suffix made sense for float8 where different flavors had different special value handling, but none of these sub 8 bit dtypes support special values.

Cool! It seems like the FP6 I used here is exactly the same as MX FP6 E3M2 (without the scale - FP6 LLM author use 1 scale per row). Perhaps in the future MX dtype can replace this.

gau-nernst · 2024-05-25T07:22:25Z

torchao/__init__.py

@@ -14,6 +8,13 @@
    from . import _C


Need to import _C first since to/from_float6_e3m2() (from dtypes) calls C++ extension for CPU.

gau-nernst · 2024-05-25T07:23:16Z

torchao/csrc/cuda/fp6_llm/weight_quant.cu

@@ -120,49 +119,14 @@ void weight_prepacking_fp16_to_fp6(uint16_t* weight_16bit,
    }
 }

-void DeQuantMatrix_FP6_To_FP16(half* A_16bit_h, unsigned char* A_6bit_h, size_t M, size_t K, half* scale) {


Replaced with from_float6_e3m2()

gau-nernst added 7 commits May 16, 2024 00:45

add fp16_to_fp6 prototype

97924d7

minor rename

8bf081c

Merge branch 'pytorch:main' into fp6_quant

558f4e4

fix rounding issue

314e9f6

Merge branch 'pytorch:main' into fp6_quant

030b956

update quant

79ce0db

add unpacked version

45a92f3

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2024

gau-nernst added 21 commits May 16, 2024 07:43

remove unnecessary comment

a8555e3

add CUDA version

012176e

add fp6 packed cpu

d4b8681

add CUDA for packed

f0f3101

some rename

f542eb1

update name

40dc725

Merge branch 'main' into fp6_quant

3a98874

add OpenMP

eef2f95

fix CUDA bug

f61aa37

add fp6->fp16

1640bbf

add FP6->FP32

7a00b31

move files around

b2fcc6c

rearrange stuff

d9ca476

add more things

ba89a0b

Merge branch 'pytorch:main' into fp6_quant

faf8682

update

e7b3135

update. add comments

8b3ac04

some rename. add some tests

0635882

add fp32->fp6 unpacked

4240692

fix

7eb6fa8

Merge branch 'main' into fp6_quant

1c0e401

gau-nernst added 2 commits May 22, 2024 07:42

polish docs

750fbc6

remove original weight dequant

6a3f0c0

gau-nernst mentioned this pull request May 21, 2024

FP6 dtype! #208

Open

gau-nernst added 2 commits May 22, 2024 07:58

remove weight dequant

f32d09f

improve tests

8b5b81e

gau-nernst mentioned this pull request May 22, 2024

torch.compile generates wrong code on CPU and compiled code replaces original function pytorch/pytorch#126848

Open

vkuzo reviewed May 22, 2024

View reviewed changes

gau-nernst added 5 commits May 22, 2024 21:39

update names

a3cf93b

rename

3c636ff

update names

f672c70

add notes about denormal numbers

1a310e3

update note

c9ec255

msaroufim approved these changes May 25, 2024

View reviewed changes

gau-nernst added 7 commits May 25, 2024 13:19

Merge branch 'main' into fp6_quant

d1697e7

Merge branch 'main' into fp6_quant

8c86028

fix merge problem

d24dba8

fix merge conflict

ce5dac1

add to_fp6 CPU C++ kernel

922446d

add from_fp6 cpu C++

d287eb3

rename

ce7e09a

gau-nernst commented May 25, 2024

View reviewed changes

gau-nernst added 4 commits May 25, 2024 15:26

add some comments

22007a1

small cleanup

f97421a

always use uint32_t for bit manipulation

f727de0

simplify test

78e79ac

msaroufim merged commit 4ca3985 into pytorch:main May 25, 2024
13 checks passed

gau-nernst deleted the fp6_quant branch May 25, 2024 22:58

msaroufim mentioned this pull request May 28, 2024

Lowering after pointwise cat can lead to uncontiguous memory accesses pytorch/pytorch#124002

Open

dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024

Improve primitives for FP6 quant (pytorch#248)

231116a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve primitives for FP6 quant #248

Improve primitives for FP6 quant #248

gau-nernst commented May 16, 2024 •

edited

Loading

pytorch-bot bot commented May 16, 2024 •

edited

Loading

vkuzo May 22, 2024

gau-nernst May 22, 2024

gau-nernst May 22, 2024

vkuzo May 22, 2024

gau-nernst May 23, 2024

gau-nernst May 25, 2024

gau-nernst May 25, 2024

		return torch.stack([bits0, bits1, bits2], dim=-1).flatten(-2)


		def to_fp6(tensor: Tensor, no_bit_packing: bool = False) -> Tensor:

Improve primitives for FP6 quant #248

Improve primitives for FP6 quant #248

Conversation

gau-nernst commented May 16, 2024 • edited Loading

pytorch-bot bot commented May 16, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/248

✅ No Failures

vkuzo May 22, 2024

Choose a reason for hiding this comment

gau-nernst May 22, 2024

Choose a reason for hiding this comment

gau-nernst May 22, 2024

Choose a reason for hiding this comment

vkuzo May 22, 2024

Choose a reason for hiding this comment

gau-nernst May 23, 2024

Choose a reason for hiding this comment

gau-nernst May 25, 2024

Choose a reason for hiding this comment

gau-nernst May 25, 2024

Choose a reason for hiding this comment

gau-nernst commented May 16, 2024 •

edited

Loading

pytorch-bot bot commented May 16, 2024 •

edited

Loading