Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Always swap operands of mfma and use mfma.transposed layout #4767

Merged
merged 6 commits into from
Sep 30, 2024

Conversation

zhanglx13
Copy link
Collaborator

@zhanglx13 zhanglx13 commented Sep 20, 2024

This PR

  • Fixed the issue with getOrder for mfma layout
  • Fixed the issue with reduceOp when dealing with mfma.transposed layout

In general, getOrder and getThreadOrder can return different values, and this is the case for mfma.transposed layout. Therefore, we shouldn't assume order and threadOrder are always the same.

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain more how the transpose is done (I think via the logic in dot to llvm conversion at register level?) and how this is expected to improve global store in the commit message? good for others to understand why this change.

For original mfmaLayout, all elements for each thread are along the M
dim. Threfore, when storing results to global memory, each thread
cannot do vectorized global store since the result tensor is always
N-minor.

This PR swaps the operands of mfma instructions, and its effect is
that in the result tensor, all elements for each thread are along the
N dim. Now threads can do vectorized global store. And this can reduce
the time of the epilogue.

We already enabled swapping mfma operands for flash attention kernels
so that the results of the first dot can be kept in register and used
as the 1st operand of the second dot.

For more details about swapping operands and how it works, please
check the presentation about AMD backend at last year's triton
conference:

Bringing Triton to AMD GPUs: https://www.youtube.com/watch?v=8o7Jhbv8xek&t=1s
In general, we should use getThreadOrder in most places where getOrder
is called. Note that order and threadOrder can be different, and this
is the case for mfma.transposed layout.
Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for adding the comments! Much clear now.

include/triton/Dialect/TritonGPU/IR/Dialect.h Outdated Show resolved Hide resolved
include/triton/Dialect/TritonGPU/IR/Dialect.h Outdated Show resolved Hide resolved
include/triton/Dialect/TritonGPU/IR/Dialect.h Outdated Show resolved Hide resolved
include/triton/Dialect/TritonGPU/IR/Dialect.h Outdated Show resolved Hide resolved
@antiagainst antiagainst marked this pull request as ready for review September 28, 2024 18:02
@antiagainst
Copy link
Collaborator

The macos failure seems to be relating to infra issues. Also I verified locally compilation on macos is fine. So merging.

@antiagainst antiagainst merged commit 755077c into triton-lang:main Sep 30, 2024
6 of 7 checks passed
@antiagainst
Copy link
Collaborator

macOS build fix at #4827

Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
…iton-lang#4767)

This helps to improve writeout to use `global_store_dwordx2`.

Along the way this PR
- Fixed the issue with getOrder for mfma layout
- Fixed the issue with reduceOp when dealing with mfma.transposed layout

In general, getOrder and getThreadOrder can return different values, and
this is the case for mfma.transposed layout. Therefore, we shouldn't
assume order and threadOrder are always the same.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants