-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] Always swap operands of mfma and use mfma.transposed layout #4767
[AMD] Always swap operands of mfma and use mfma.transposed layout #4767
Conversation
2065595
to
c525ee0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain more how the transpose is done (I think via the logic in dot to llvm conversion at register level?) and how this is expected to improve global store in the commit message? good for others to understand why this change.
c525ee0
to
b3a9fcd
Compare
For original mfmaLayout, all elements for each thread are along the M dim. Threfore, when storing results to global memory, each thread cannot do vectorized global store since the result tensor is always N-minor. This PR swaps the operands of mfma instructions, and its effect is that in the result tensor, all elements for each thread are along the N dim. Now threads can do vectorized global store. And this can reduce the time of the epilogue. We already enabled swapping mfma operands for flash attention kernels so that the results of the first dot can be kept in register and used as the 1st operand of the second dot. For more details about swapping operands and how it works, please check the presentation about AMD backend at last year's triton conference: Bringing Triton to AMD GPUs: https://www.youtube.com/watch?v=8o7Jhbv8xek&t=1s
In general, we should use getThreadOrder in most places where getOrder is called. Note that order and threadOrder can be different, and this is the case for mfma.transposed layout.
b3a9fcd
to
7ce8775
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for adding the comments! Much clear now.
The macos failure seems to be relating to infra issues. Also I verified locally compilation on macos is fine. So merging. |
macOS build fix at #4827 |
…iton-lang#4767) This helps to improve writeout to use `global_store_dwordx2`. Along the way this PR - Fixed the issue with getOrder for mfma layout - Fixed the issue with reduceOp when dealing with mfma.transposed layout In general, getOrder and getThreadOrder can return different values, and this is the case for mfma.transposed layout. Therefore, we shouldn't assume order and threadOrder are always the same.
This PR
In general, getOrder and getThreadOrder can return different values, and this is the case for mfma.transposed layout. Therefore, we shouldn't assume order and threadOrder are always the same.