Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Goal for trition.ops.flash_attention #2267

Open
EPronovost opened this issue Sep 8, 2023 · 3 comments
Open

[RFC] Goal for trition.ops.flash_attention #2267

EPronovost opened this issue Sep 8, 2023 · 3 comments

Comments

@EPronovost
Copy link
Contributor

Hi! The flash attention implementation is really helpful as a reference. I noticed that the code currently makes some assumptions (e.g. about shapes and strides) and can silently produce incorrect results if used incorrectly.

I've seen some related issues and PRs (e.g. #2033, #2029, #2046, #2086) and am not clear what are the intended goals of this code. Two possibilities I can imagine:

  1. [For developers] This code serves as a complex use case that helps to catch bugs. Covering more use cases (e.g. a different number of queries and keys) or making it "user-friendly" is not a priority.
  2. [For users] This code is meant for users as an alternative to other flash attention implementations. Improving the user experience (e.g. no silently incorrect results) and expanding coverage of use cases are positives.

How do the core developers think about triton.ops.flash_attention? Would you welcome contributions to improve the user experience of this code (e.g. follow up on #2033)? On the one hand I think having a flash attention implementation on par with other libraries could help get folks interested in Triton; on the other hand I imagine the core devs aren't looking to have to maintain more code without a good reason. I'd be happy to help add features to this code if that aligns with what the core devs want.

@jon-chuang
Copy link
Contributor

jon-chuang commented Sep 8, 2023

I myself have been hoping for a collection or library of more advanced applications of triton.

However, it seems that the core concern in this repo is the compiler correctness and performance.

The application code with more production concerns ideally live somewhere else. Perhaps called triton-extra or poseidon. I would be keen to contribute there without getting in the way of the compiler development.

For instance, experimentation with more advanced kernels like #2243, #2259 could take place there

@Jokeren
Copy link
Contributor

Jokeren commented Sep 9, 2023

You probably want to have a discussion with @daemyung

@daemyung
Copy link
Contributor

daemyung commented Sep 9, 2023

@Jokeren Thanks for noticing me.

My opinion is that Triton is a language, akin to CUDA and SYCL, rather than a library. Consequently, supporting various operations (e.g., flash attention) falls outside Triton's scope. Consider CUDA for comparison: CUDA itself doesn't offer implementations for specific operations. Instead, libraries based on CUDA (like CUTLASS, cuBLAS, and cub) provide those implementations.

For these reasons, I have initiated Trident. Trident is a performance library designed for machine learning applications, with a focus on accelerating both training and inference. It comprises highly optimized kernels, functions, and modules tailored for machine learning and is built upon Triton.

Therefore, I believe the purpose of triton.ops.flash_attention is geared towards developers rather than users.

I want to share the reasoning behind the name 'Trident'. Triton is a Greek god, and his signature weapon is Trident. The library was named 'Trident' because, much like the god Triton exhibits his full potential when wielding Trident, our Triton reaches its peak performance when paired with Trident.

trident

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants