-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 4.4.0 and flash attention with python [WIP] #1775
Comments
|
Will bench in the near future when I have the time, hence WIP in the title, and let ya'll know if my results are different. I previously benched and noticed significant benefits when solely the |
UPDATE; Don't have time to bench but will try my best in the future. Closing for now. |
It looks like Flash Attention was removed from the Python portion in release 4.4.0...I had a few questions:
Can you confirm that flash attention is still available in Release 4.3.1? No benchmarking was done on long-context QA with/without Flash Attention 2 that I'm aware of...only "relatively" short prompts/contexts. I'd like to be able to bench FA on longer contexts still to see if there's a meaningful benefit.
Is it possible to compile version 4.4.0 to include Flash Attention in a Wheel file even though a Wheel file won't be uploaded to pypi.org? If it's worth it, I'd like to use version 4.4.0's improvements AND flash attention if my benchmarking indicates it's advantageous to do so. I'm not very familiar with compiling in general so forgive the Q, but basically if I compile will it include the relevant "python" portions that you say are not omitted?
Thanks again for the great work.
The text was updated successfully, but these errors were encountered: