Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD] Always swap operands of mfma and use mfma.transposed layout #4767

Merged
merged 6 commits into from
Sep 30, 2024

Commits on Sep 27, 2024

  1. Always swap operands of mfma and use mfma.transposed layout

    For original mfmaLayout, all elements for each thread are along the M
    dim. Threfore, when storing results to global memory, each thread
    cannot do vectorized global store since the result tensor is always
    N-minor.
    
    This PR swaps the operands of mfma instructions, and its effect is
    that in the result tensor, all elements for each thread are along the
    N dim. Now threads can do vectorized global store. And this can reduce
    the time of the epilogue.
    
    We already enabled swapping mfma operands for flash attention kernels
    so that the results of the first dot can be kept in register and used
    as the 1st operand of the second dot.
    
    For more details about swapping operands and how it works, please
    check the presentation about AMD backend at last year's triton
    conference:
    
    Bringing Triton to AMD GPUs: https://www.youtube.com/watch?v=8o7Jhbv8xek&t=1s
    zhanglx13 committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    e7a8b3d View commit details
    Browse the repository at this point in the history
  2. Fix mfmaT unit tests

    zhanglx13 committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    8a087ef View commit details
    Browse the repository at this point in the history
  3. fix order related issue with reduceOp

    In general, we should use getThreadOrder in most places where getOrder
    is called. Note that order and threadOrder can be different, and this
    is the case for mfma.transposed layout.
    zhanglx13 committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    7ce8775 View commit details
    Browse the repository at this point in the history

Commits on Sep 28, 2024

  1. Add comments

    zhanglx13 committed Sep 28, 2024
    Configuration menu
    Copy the full SHA
    07c890d View commit details
    Browse the repository at this point in the history

Commits on Sep 29, 2024

  1. Add more comments

    zhanglx13 committed Sep 29, 2024
    Configuration menu
    Copy the full SHA
    ae91ca7 View commit details
    Browse the repository at this point in the history

Commits on Sep 30, 2024

  1. Configuration menu
    Copy the full SHA
    668545c View commit details
    Browse the repository at this point in the history