Refactor custom FPx cast #363

gau-nernst · 2024-06-14T15:32:38Z

Closes #354

TODO:

~~Check torch.compile~~
~~Benchmark before and after~~

python torchao/prototype/mx_formats/benchmarks/bench_qdq.py

8841094 (main)

elem_dtype	use_fp4_custom_triton_dequant_kernel	q_time_us	q_mem_bw_tb_s	dq_time_us	dq_mem_bw_tb_s
torch.float8_e4m3fn	False	532.20	0.26	554.40	0.25
torch.float8_e5m2	False	532.83	0.26	551.01	0.25
fp6_e2m3	False	574.66	0.24	258.41	0.53
fp6_e3m2	False	577.37	0.24	258.86	0.53
fp4_e2m1	False	682.13	0.17	254.72	0.45
fp4_e2m1	True	12251.34	0.01	190.39	0.60

2690b92 (this PR)

elem_dtype	use_fp4_custom_triton_dequant_kernel	q_time_us	q_mem_bw_tb_s	dq_time_us	dq_mem_bw_tb_s
torch.float8_e4m3fn	False	531.62	0.26	552.91	0.25
torch.float8_e5m2	False	530.52	0.26	550.33	0.25
fp6_e2m3	False	572.89	0.24	551.21	0.25
fp6_e3m2	False	576.62	0.24	551.92	0.25
fp4_e2m1	False	680.27	0.17	255.10	0.45
fp4_e2m1	True	12248.68	0.01	191.09	0.60

Dequant is 2x slower because I replaced LUT-based denormal handling with a more generic logic. @vkuzo Should I add back the LUT-based logic (check specifically for E2M3 E3M2 E2M1)? If we are interested in performance then perhaps we can generate a LUT for all bit patterns and cache it.

UPDATE

95f4582 (this PR v2)

elem_dtype	use_fp4_custom_triton_dequant_kernel	q_time_us	q_mem_bw_tb_s	dq_time_us	dq_mem_bw_tb_s
torch.float8_e4m3fn	False	531.54	0.26	554.03	0.25
torch.float8_e5m2	False	532.41	0.26	551.63	0.25
fp6_e2m3	False	574.66	0.24	258.28	0.53
fp6_e3m2	False	576.53	0.24	258.76	0.53
fp4_e2m1	False	682.26	0.17	517.65	0.22
fp4_e2m1	True	12247.99	0.01	190.48	0.60

Now FP4_E2M1 is slower lol. Feel like this should be bandwidth-limited. It might be register-limited also? Will do some profiling + make sure torch.compile run optimally. Interesting that native PyTorch float8 dequant is slower.

UPDATE 2

dcd5a05 (this PR v3)

elem_dtype	use_fp4_custom_triton_dequant_kernel	q_time_us	q_mem_bw_tb_s	dq_time_us	dq_mem_bw_tb_s
torch.float8_e4m3fn	False	532.65	0.26	554.13	0.25
torch.float8_e5m2	False	532.07	0.26	551.30	0.25
fp6_e2m3	False	574.76	0.24	258.48	0.53
fp6_e3m2	False	576.85	0.24	258.36	0.53
fp4_e2m1	False	681.22	0.17	254.14	0.45
fp4_e2m1	True	12249.79	0.01	190.49	0.60

Speed recovered 😊

pytorch-bot · 2024-06-14T15:32:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/363

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bd64efc with merge base 664f073 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2024-06-15T04:38:58Z

Dequant is 2x slower because I replaced LUT-based denormal handling with a more generic logic.

2x is a sizeable regression, how about keeping the LUT for the formats we already have it for and having a generic fallback for the other formats? People can then optimize format by format individually if they want.

gau-nernst · 2024-06-15T07:33:35Z

@vkuzo I have updated the dequant denormal implementation. No speed regression anymore (I updated the results in the 1st post). Didn't need to use the hard-coded LUT from your implementation. If torch compiler does constant folding and loop unrolling properly, I think my implementation should match your previous implementation exactly.

If possible, you can benchmark on your GPUs to make sure 100% there is no regression.

vkuzo · 2024-06-17T13:28:57Z

Here are results on an H100: https://gist.github.com/vkuzo/324256b8defd0231852a23cbb34f49a6, I see no meaningful change in performance, awesome stuff

vkuzo · 2024-06-17T13:29:28Z

torchao/prototype/custom_fp_utils.py

+
+def _fpx_unpacked_to_f32(x: Tensor, ebits: int, mbits: int) -> Tensor:
+    """
+    TODO(future): check if LUT for everything is faster than bit shifting,


is this comment still relevant?

maybe add a docblock?

using LUT for everything in dequant might be faster, like current NF4 implementation. I haven't benchmarked so I'm not sure.
I didn't add a docblock here since I think this is kinda an internal function. But a simple doc won't hurt. Will add some doc for this and quant function above. I already added a short description for these 2 functions at the top of the file.

vkuzo · 2024-06-17T13:29:37Z

torchao/prototype/custom_fp_utils.py

+F32_EXP_BIAS = _n_ones(EBITS_F32 - 1)
+
+
+def _f32_to_fpx_unpacked(x: Tensor, ebits: int, mbits: int) -> Tensor:


should we have a docblock?

* refactor custom fp cast * add dequant * small formating * compile with fullgraph=True * add fullgraph=true * undo * add another version * fast path for mbits=1 * add back docstring

gau-nernst added 2 commits June 14, 2024 22:03

refactor custom fp cast

47f7bc1

add dequant

da17611

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 14, 2024

gau-nernst added 4 commits June 14, 2024 23:38

small formating

3345740

compile with fullgraph=True

2690b92

add fullgraph=true

8aa0146

undo

be77632

gau-nernst requested a review from vkuzo June 15, 2024 00:58

add another version

95f4582

fast path for mbits=1

dcd5a05

gau-nernst marked this pull request as ready for review June 15, 2024 07:35

Merge branch 'pytorch:main' into custom_fpx

f61ff05

vkuzo reviewed Jun 17, 2024

View reviewed changes

vkuzo approved these changes Jun 17, 2024

View reviewed changes

gau-nernst added 2 commits June 17, 2024 21:42

Merge branch 'pytorch:main' into custom_fpx

4ad065f

add back docstring

bd64efc

msaroufim merged commit eb1511e into pytorch:main Jun 17, 2024
13 checks passed

gau-nernst deleted the custom_fpx branch June 17, 2024 15:21

gau-nernst mentioned this pull request Jun 25, 2024

Add FP5 E2M2 support from upstream #399

Merged

vkuzo mentioned this pull request Jul 5, 2024

Support MX4 E3M0 format and add stochastic rounding #477

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor custom FPx cast #363

Refactor custom FPx cast #363

gau-nernst commented Jun 14, 2024 •

edited

Loading

pytorch-bot bot commented Jun 14, 2024 •

edited

Loading

vkuzo commented Jun 15, 2024

gau-nernst commented Jun 15, 2024 •

edited

Loading

vkuzo commented Jun 17, 2024

vkuzo Jun 17, 2024

gau-nernst Jun 17, 2024 •

edited

Loading

vkuzo Jun 17, 2024

		F32_EXP_BIAS = _n_ones(EBITS_F32 - 1)


		def _f32_to_fpx_unpacked(x: Tensor, ebits: int, mbits: int) -> Tensor:

Refactor custom FPx cast #363

Refactor custom FPx cast #363

Conversation

gau-nernst commented Jun 14, 2024 • edited Loading

pytorch-bot bot commented Jun 14, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/363

✅ No Failures

vkuzo commented Jun 15, 2024

gau-nernst commented Jun 15, 2024 • edited Loading

vkuzo commented Jun 17, 2024

vkuzo Jun 17, 2024

Choose a reason for hiding this comment

gau-nernst Jun 17, 2024 • edited Loading

Choose a reason for hiding this comment

vkuzo Jun 17, 2024

Choose a reason for hiding this comment

gau-nernst commented Jun 14, 2024 •

edited

Loading

pytorch-bot bot commented Jun 14, 2024 •

edited

Loading

gau-nernst commented Jun 15, 2024 •

edited

Loading

gau-nernst Jun 17, 2024 •

edited

Loading