[AMD] Always swap operands of mfma and use mfma.transposed layout #4767

zhanglx13 · 2024-09-20T04:06:00Z

This PR

Fixed the issue with getOrder for mfma layout
Fixed the issue with reduceOp when dealing with mfma.transposed layout

In general, getOrder and getThreadOrder can return different values, and this is the case for mfma.transposed layout. Therefore, we shouldn't assume order and threadOrder are always the same.

antiagainst

Could you explain more how the transpose is done (I think via the logic in dot to llvm conversion at register level?) and how this is expected to improve global store in the commit message? good for others to understand why this change.

include/triton/Dialect/TritonGPU/IR/Dialect.h

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp

For original mfmaLayout, all elements for each thread are along the M dim. Threfore, when storing results to global memory, each thread cannot do vectorized global store since the result tensor is always N-minor. This PR swaps the operands of mfma instructions, and its effect is that in the result tensor, all elements for each thread are along the N dim. Now threads can do vectorized global store. And this can reduce the time of the epilogue. We already enabled swapping mfma operands for flash attention kernels so that the results of the first dot can be kept in register and used as the 1st operand of the second dot. For more details about swapping operands and how it works, please check the presentation about AMD backend at last year's triton conference: Bringing Triton to AMD GPUs: https://www.youtube.com/watch?v=8o7Jhbv8xek&t=1s

In general, we should use getThreadOrder in most places where getOrder is called. Note that order and threadOrder can be different, and this is the case for mfma.transposed layout.

antiagainst

Cool, thanks for adding the comments! Much clear now.

include/triton/Dialect/TritonGPU/IR/Dialect.h

antiagainst · 2024-09-30T04:05:58Z

The macos failure seems to be relating to infra issues. Also I verified locally compilation on macos is fine. So merging.

antiagainst · 2024-09-30T05:20:39Z

macOS build fix at #4827

…iton-lang#4767) This helps to improve writeout to use `global_store_dwordx2`. Along the way this PR - Fixed the issue with getOrder for mfma layout - Fixed the issue with reduceOp when dealing with mfma.transposed layout In general, getOrder and getThreadOrder can return different values, and this is the case for mfma.transposed layout. Therefore, we shouldn't assume order and threadOrder are always the same.

zhanglx13 force-pushed the fix_order_mfma_transposed branch from 2065595 to c525ee0 Compare September 20, 2024 19:24

antiagainst requested changes Sep 20, 2024

View reviewed changes

include/triton/Dialect/TritonGPU/IR/Dialect.h Show resolved Hide resolved

third_party/amd/lib/TritonAMDGPUTransforms/AccelerateAMDMatmul.cpp Show resolved Hide resolved

zhanglx13 force-pushed the fix_order_mfma_transposed branch from c525ee0 to b3a9fcd Compare September 26, 2024 22:21

zhanglx13 added 3 commits September 27, 2024 17:16

Fix mfmaT unit tests

8a087ef

fix order related issue with reduceOp

7ce8775

In general, we should use getThreadOrder in most places where getOrder is called. Note that order and threadOrder can be different, and this is the case for mfma.transposed layout.

zhanglx13 force-pushed the fix_order_mfma_transposed branch from b3a9fcd to 7ce8775 Compare September 27, 2024 22:39

Add comments

07c890d

antiagainst approved these changes Sep 28, 2024

View reviewed changes

antiagainst marked this pull request as ready for review September 28, 2024 18:02

antiagainst requested review from Jokeren and ptillet as code owners September 28, 2024 18:02

zhanglx13 and others added 2 commits September 29, 2024 14:37

Add more comments

ae91ca7

Merge branch 'main' into fix_order_mfma_transposed

668545c

antiagainst merged commit 755077c into triton-lang:main Sep 30, 2024
6 of 7 checks passed

antiagainst mentioned this pull request Sep 30, 2024

[AMD] Implement dotOperandMfma to linear layout conversion #4817

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Always swap operands of mfma and use mfma.transposed layout #4767

[AMD] Always swap operands of mfma and use mfma.transposed layout #4767

zhanglx13 commented Sep 20, 2024 •

edited

Loading

antiagainst left a comment

antiagainst left a comment

antiagainst commented Sep 30, 2024

antiagainst commented Sep 30, 2024

[AMD] Always swap operands of mfma and use mfma.transposed layout #4767

[AMD] Always swap operands of mfma and use mfma.transposed layout #4767

Conversation

zhanglx13 commented Sep 20, 2024 • edited Loading

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst left a comment

Choose a reason for hiding this comment

antiagainst commented Sep 30, 2024

antiagainst commented Sep 30, 2024

zhanglx13 commented Sep 20, 2024 •

edited

Loading