-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Goal for trition.ops.flash_attention #2267
Comments
I myself have been hoping for a collection or library of more advanced applications of triton. However, it seems that the core concern in this repo is the compiler correctness and performance. The application code with more production concerns ideally live somewhere else. Perhaps called For instance, experimentation with more advanced kernels like #2243, #2259 could take place there |
You probably want to have a discussion with @daemyung |
@Jokeren Thanks for noticing me. My opinion is that Triton is a language, akin to CUDA and SYCL, rather than a library. Consequently, supporting various operations (e.g., flash attention) falls outside Triton's scope. Consider CUDA for comparison: CUDA itself doesn't offer implementations for specific operations. Instead, libraries based on CUDA (like CUTLASS, cuBLAS, and cub) provide those implementations. For these reasons, I have initiated Trident. Trident is a performance library designed for machine learning applications, with a focus on accelerating both training and inference. It comprises highly optimized kernels, functions, and modules tailored for machine learning and is built upon Triton. Therefore, I believe the purpose of I want to share the reasoning behind the name 'Trident'. Triton is a Greek god, and his signature weapon is Trident. The library was named 'Trident' because, much like the god Triton exhibits his full potential when wielding Trident, our Triton reaches its peak performance when paired with Trident. |
Hi! The flash attention implementation is really helpful as a reference. I noticed that the code currently makes some assumptions (e.g. about shapes and strides) and can silently produce incorrect results if used incorrectly.
I've seen some related issues and PRs (e.g. #2033, #2029, #2046, #2086) and am not clear what are the intended goals of this code. Two possibilities I can imagine:
How do the core developers think about
triton.ops.flash_attention
? Would you welcome contributions to improve the user experience of this code (e.g. follow up on #2033)? On the one hand I think having a flash attention implementation on par with other libraries could help get folks interested in Triton; on the other hand I imagine the core devs aren't looking to have to maintain more code without a good reason. I'd be happy to help add features to this code if that aligns with what the core devs want.The text was updated successfully, but these errors were encountered: