Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

etiotto · 2024-09-27T15:27:27Z

We would like to remove the RewriteTensorPointer pass which rewrites block pointers into regular pointers (except when it determines load/store operations on block ptrs can be converted to 2D block reads/writes). The idea is to avoid loosing semantic information too early and instead deal with block ptr that cannot be used to generate 2D block reads/stores while lowering that operation).

For this scheme to work, we first need to improve the lowering code for tt.load and tt.store operations that use a block ptr with an element type that is not (currently) supported by the 2D read instructions available on the target GPU (e.g. the element is FP8).

See #2359 (comment) for more context.

The text was updated successfully, but these errors were encountered:

etiotto · 2024-10-09T18:47:48Z

The first step is to improve axis analysis and add support for blocked pointers to it (#2451).

etiotto added performance codegen: gemm codegen: attention labels Sep 27, 2024

vlad-penkin added this to the 0.3 [Triton] Language and Runtime milestone Sep 27, 2024

vlad-penkin assigned etiotto Sep 30, 2024

This was referenced Oct 1, 2024

Assertion error on gemm_splitk_benchmark.py #2377

Open

Failure to compile gemm_postop_addmatrix_benchmark.py with #2378

Closed

etiotto linked a pull request Oct 16, 2024 that will close this issue

[WIP]: Coalescing for load/store with block ptrs #2502

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

etiotto commented Sep 27, 2024 •

edited

Loading

etiotto commented Oct 9, 2024 •

edited

Loading

Improve performance of tt.load and tt.store for FP8 when converting block ptr to regular ptrs #2374

Improve performance of tt.load and tt.store for FP8 when converting block ptr to regular ptrs #2374

Comments

etiotto commented Sep 27, 2024 • edited Loading

etiotto commented Oct 9, 2024 • edited Loading

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

etiotto commented Sep 27, 2024 •

edited

Loading

etiotto commented Oct 9, 2024 •

edited

Loading